Wombo (stylized as WOMBO ) is a Canadian tech startup centered around AI . Their flagship product is an app titled Dream, released in 2021, that has features such as using a provided selfie to create a deepfake of a person, text to image generation , and more.
86-405: Dream is an image and video generation app powered by Stable Diffusion . It can be used to create images from text using a variety of style presets. It can also generate a deepfake using 5-10 images of source material. The app includes a premium tier, which gives users priority processing time and no in-app ads. Wombo processes images in the cloud . CEO Ben-Zion Benkhin says that all user data
172-403: A l ) 2 {\displaystyle L_{\theta ,\phi }=\mathbb {E} _{x\sim \mathbb {P} ^{real}}\left[\|x-D_{\theta }(E_{\phi }(x))\|_{2}^{2}\right]+d\left(\mu (dz),E_{\phi }\sharp \mathbb {P} ^{real}\right)^{2}} The statistical distance d {\displaystyle d} requires special properties, for instance is has to be posses a formula as expectation because
258-439: A variational autoencoder ( VAE ) is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling . It is part of the families of probabilistic graphical models and variational Bayesian methods . In addition to being seen as an autoencoder neural network architecture, variational autoencoders can also be studied within the mathematical formulation of variational Bayesian methods , connecting
344-464: A watermark with greater than 80% probability. Final rounds of training additionally dropped 10% of text conditioning to improve Classifier-Free Diffusion Guidance. The model was trained using 256 Nvidia A100 GPUs on Amazon Web Services for a total of 150,000 GPU-hours, at a cost of $ 600,000. SD3 was trained at a cost of around $ 10 million. Stable Diffusion has issues with degradation and inaccuracies in certain scenarios. Initial releases of
430-391: A "standard random number generator ", and construct z {\displaystyle z} as z = μ ϕ ( x ) + L ϕ ( x ) ϵ {\displaystyle z=\mu _{\phi }(x)+L_{\phi }(x)\epsilon } . Here, L ϕ ( x ) {\displaystyle L_{\phi }(x)}
516-549: A computational donation from Stability and training data from non-profit organizations. Stable Diffusion is a latent diffusion model , a kind of deep generative artificial neural network . Its code and model weights have been released publicly , and it can run on most consumer hardware equipped with a modest GPU with at least 4 GB VRAM . This marked a departure from previous proprietary text-to-image models such as DALL-E and Midjourney which were accessible only via cloud services . Stable Diffusion originated from
602-457: A kind of diffusion model (DM), called a latent diffusion model (LDM) , developed by the CompVis (Computer Vision & Learning) group at LMU Munich . Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images, which can be thought of as a sequence of denoising autoencoders . Stable Diffusion consists of 3 parts:
688-453: A latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The denoising step can be flexibly conditioned on a string of text, an image, or another modality. The encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism . For conditioning on text, the fixed, pretrained CLIP ViT-L/14 text encoder
774-428: A lower scale value, while use cases aiming for more specific outputs may use a higher value. Additional text2img features are provided by front-end implementations of Stable Diffusion, which allow users to modify the weight given to specific parts of the text prompt. Emphasis markers allow users to add or reduce emphasis to keywords by enclosing them with brackets. An alternative method of adjusting weight to parts of
860-416: A minimum 30 GB of VRAM , which exceeds the usual resource provided in such consumer GPUs as Nvidia 's GeForce 30 series , which has only about 12 GB. The creators of Stable Diffusion acknowledge the potential for algorithmic bias , as the model was primarily trained on images with English descriptions. As a result, generated images reinforce social biases and are from a western perspective, as
946-570: A more abstract way the operation of the VAE. In these approaches the loss function is composed of two parts : We obtain the final formula for the loss: L θ , ϕ = E x ∼ P r e a l [ ‖ x − D θ ( E ϕ ( x ) ) ‖ 2 2 ] + d ( μ ( d z ) , E ϕ ♯ P r e
SECTION 10
#17328594989001032-421: A neural encoder network to its decoder through a probabilistic latent space (for example, as a multivariate Gaussian distribution ) that corresponds to the parameters of a variational distribution. Thus, the encoder maps each point (such as an image) from a large complex dataset into a distribution within the latent space, rather than to a single point in that space. The decoder has the opposite function, which
1118-426: A prior is assumed over the latents z {\displaystyle z} results in intractable integrals. Let us find p θ ( x ) {\displaystyle p_{\theta }(x)} via marginalizing over z {\displaystyle z} . where p θ ( x , z ) {\displaystyle p_{\theta }({x,z})} represents
1204-433: A problem. In order to customize the model for new use cases that are not included in the dataset, such as generating anime characters ("waifu diffusion"), new data and further training are required. Fine-tuned adaptations of Stable Diffusion created through additional retraining have been used for a variety of different use-cases, from medical imaging to algorithmically generated music . However, this fine-tuning process
1290-433: A process also referred to as personalization . There are three methods in which user-accessible fine-tuning can be applied to a Stable Diffusion model checkpoint: The Stable Diffusion model supports the ability to generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output. Existing images can be re-drawn by the model to incorporate new elements described by
1376-605: A project called Latent Diffusion , developed in Germany by researchers at Ludwig Maximilian University in Munich and Heidelberg University . Four of the original 5 authors (Robin Rombach, Andreas Blattmann, Patrick Esser and Dominik Lorenz) later joined Stability AI and released subsequent versions of Stable Diffusion. The technical license for the model was released by the CompVis group at Ludwig Maximilian University of Munich. Development
1462-445: A scheme optimizes a lower bound of the data likelihood, which is usually intractable, and in doing so requires the discovery of q-distributions, or variational posteriors . These q-distributions are normally parameterized for each individual data point in a separate optimization process. However, variational autoencoders use a neural network as an amortized approach to jointly optimize across data points. This neural network takes as input
1548-484: A text prompt (a process known as "guided image synthesis" ) through its diffusion-denoising mechanism. In addition, the model also allows the use of prompts to partially alter existing images via inpainting and outpainting, when used with an appropriate user interface that supports such features, of which numerous different open source implementations exist. Stable Diffusion is recommended to be run with 10 GB or more VRAM, however users with less VRAM may opt to load
1634-428: A user interface for Stable Diffusion, took place, with the hackers claiming they targeted users who committed "one of our sins", which included AI-art generation, art theft, promoting cryptocurrency. In January 2023, three artists, Sarah Andersen , Kelly McKernan , and Karla Ortiz, filed a copyright infringement lawsuit against Stability AI, Midjourney , and DeviantArt , claiming that these companies have infringed
1720-439: Is computed by the probabilistic encoder . Parametrize the encoder as E ϕ {\displaystyle E_{\phi }} , and the decoder as D θ {\displaystyle D_{\theta }} . As in every deep learning problem, it is necessary to define a differentiable loss function in order to update the network weights through backpropagation . For variational autoencoders,
1806-651: Is deleted after 24 hours. Dream was developed in Canada and launched in February 2021 after a beta period in January. Wombo CEO Ben-Zion Benkhin says he got the idea for the app in August 2020. The app is available on both the App Store and Google Play Store . Within its first three weeks of release, the app was downloaded over 20 million times, and over 100 million clips were created using
SECTION 20
#17328594989001892-900: Is expanded as Now define the evidence lower bound (ELBO): L θ , ϕ ( x ) := E z ∼ q ϕ ( ⋅ | x ) [ ln p θ ( x , z ) q ϕ ( z | x ) ] = ln p θ ( x ) − D K L ( q ϕ ( ⋅ | x ) ∥ p θ ( ⋅ | x ) ) {\displaystyle L_{\theta ,\phi }(x):=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }({z|x})}}\right]=\ln p_{\theta }(x)-D_{KL}(q_{\phi }({\cdot |x})\parallel p_{\theta }({\cdot |x}))} Maximizing
1978-483: Is implemented as − 1 2 ‖ x − D θ ( z ) ‖ 2 2 {\displaystyle -{\frac {1}{2}}\|x-D_{\theta }(z)\|_{2}^{2}} , since that is, up to an additive constant, what x ∼ N ( D θ ( z ) , I ) {\displaystyle x\sim {\mathcal {N}}(D_{\theta }(z),I)} yields. That is, we model
2064-417: Is named "multimodal diffusion transformer (MMDiT), where the "multimodal" means that it mixes text and image encodings inside its operations. This differs from previous versions of DiT, where the text encoding affects the image encoding, but not vice versa. Stable Diffusion was trained on pairs of images and captions taken from LAION-5B, a publicly available dataset derived from Common Crawl data scraped from
2150-467: Is normally distributed, as N ( μ ϕ ( x ) , Σ ϕ ( x ) ) {\displaystyle {\mathcal {N}}(\mu _{\phi }(x),\Sigma _{\phi }(x))} . This can be reparametrized by letting ε ∼ N ( 0 , I ) {\displaystyle {\boldsymbol {\varepsilon }}\sim {\mathcal {N}}(0,{\boldsymbol {I}})} be
2236-445: Is notably more permissive in the types of content users may generate, such as violent or sexually explicit imagery, in comparison to other commercial products based on generative AI. Addressing the concerns that the model may be used for abusive purposes, CEO of Stability AI, Emad Mostaque , argues that "[it is] peoples' responsibility as to whether they are ethical, moral, and legal in how they operate this technology", and that putting
2322-1439: Is obtained by the Cholesky decomposition : Σ ϕ ( x ) = L ϕ ( x ) L ϕ ( x ) T {\displaystyle \Sigma _{\phi }(x)=L_{\phi }(x)L_{\phi }(x)^{T}} Then we have ∇ ϕ E z ∼ q ϕ ( ⋅ | x ) [ ln p θ ( x , z ) q ϕ ( z | x ) ] = E ϵ [ ∇ ϕ ln p θ ( x , μ ϕ ( x ) + L ϕ ( x ) ϵ ) q ϕ ( μ ϕ ( x ) + L ϕ ( x ) ϵ | x ) ] {\displaystyle \nabla _{\phi }\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }({z|x})}}\right]=\mathbb {E} _{\epsilon }\left[\nabla _{\phi }\ln {\frac {p_{\theta }(x,\mu _{\phi }(x)+L_{\phi }(x)\epsilon )}{q_{\phi }(\mu _{\phi }(x)+L_{\phi }(x)\epsilon |x)}}\right]} and so we obtained an unbiased estimator of
2408-426: Is reflected by the fact that any restrictions Stability AI places on the content that users may generate can easily be bypassed due to the availability of the source code. Controversy around photorealistic sexualized depictions of underage characters have been brought up, due to such images generated by Stable Diffusion being shared on websites such as Pixiv . In June of 2024, a hack on an extension of ComfyUI ,
2494-402: Is sensitive to the quality of new data; low resolution images or different resolutions from the original data can not only fail to learn the new task but degrade the overall performance of the model. Even when the model is additionally trained on high quality images, it is difficult for individuals to run models in consumer electronics. For example, the training process for waifu-diffusion requires
2580-975: Is the Jacobian matrix of ϵ {\displaystyle \epsilon } with respect to z {\displaystyle z} . Since z = μ ϕ ( x ) + L ϕ ( x ) ϵ {\displaystyle z=\mu _{\phi }(x)+L_{\phi }(x)\epsilon } , this is ln q ϕ ( z | x ) = − 1 2 ‖ ϵ ‖ 2 − ln | det L ϕ ( x ) | − n 2 ln ( 2 π ) {\displaystyle \ln q_{\phi }(z|x)=-{\frac {1}{2}}\|\epsilon \|^{2}-\ln |\det L_{\phi }(x)|-{\frac {n}{2}}\ln(2\pi )} Many variational autoencoders applications and extensions have been used to adapt
2666-521: Is the dimension of z {\displaystyle z} . For a more detailed derivation and more interpretations of ELBO and its maximization, see its main page . To efficiently search for θ ∗ , ϕ ∗ = argmax θ , ϕ L θ , ϕ ( x ) {\displaystyle \theta ^{*},\phi ^{*}={\underset {\theta ,\phi }{\operatorname {argmax} }}\,L_{\theta ,\phi }(x)}
Wombo - Misplaced Pages Continue
2752-431: Is the most popular and offers extra features, Fooocus , which aims to decrease the amount of prompting needed by the user, and ComfyUI , which has a node-based user interface, essentially a visual programming language akin to many 3D modeling applications. Key papers Training cost Stable Diffusion claims no rights on generated images and freely gives users the rights of usage to any generated images from
2838-533: Is the premier product of Stability AI and is considered to be a part of the ongoing artificial intelligence boom . It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting , outpainting, and generating image-to-image translations guided by a text prompt . Its development involved researchers from the CompVis Group at Ludwig Maximilian University of Munich and Runway with
2924-667: Is the remainder of the free energy expression, and requires a sampling approximation to compute its expectation value. More recent approaches replace Kullback–Leibler divergence (KL-D) with various statistical distances , see see section "Statistical distance VAE variants" below. . From the point of view of probabilistic modeling, one wants to maximize the likelihood of the data x {\displaystyle x} by their chosen parameterized probability distribution p θ ( x ) = p ( x | θ ) {\displaystyle p_{\theta }(x)=p(x|\theta )} . This distribution
3010-469: Is to map from the latent space to the input space, again according to a distribution (although in practice, noise is rarely added during the decoding stage). By mapping a point to a distribution instead of a single point, the network can avoid overfitting the training data. Both networks are typically trained together with the usage of the reparameterization trick , although the variance of the noise model can be learned separately. Although this type of model
3096-465: Is used to transform text prompts to an embedding space. Researchers point to increased computational efficiency for training and generation as an advantage of LDMs. The name diffusion takes inspiration from the thermodynamic diffusion and an important link was made between this purely physical field and deep learning in 2015. With 860 million parameters in the U-Net and 123 million in
3182-481: Is usually chosen to be a Gaussian N ( x | μ , σ ) {\displaystyle N(x|\mu ,\sigma )} which is parameterized by μ {\displaystyle \mu } and σ {\displaystyle \sigma } respectively, and as a member of the exponential family it is easy to work with as a noise distribution. Simple distributions are easy enough to maximize, however distributions where
3268-403: Is usually taken to be a finite-dimensional vector of real numbers, and p θ ( x | z ) {\displaystyle p_{\theta }({x|z})} to be a Gaussian distribution . Then p θ ( x ) {\displaystyle p_{\theta }(x)} is a mixture of Gaussian distributions. It is now possible to define the set of
3354-507: The ∇ ϕ {\displaystyle \nabla _{\phi }} inside the expectation, since ϕ {\displaystyle \phi } appears in the probability distribution itself. The reparameterization trick (also known as stochastic backpropagation ) bypasses this difficulty. The most important example is when z ∼ q ϕ ( ⋅ | x ) {\displaystyle z\sim q_{\phi }(\cdot |x)}
3440-409: The joint distribution under p θ {\displaystyle p_{\theta }} of the observable data x {\displaystyle x} and its latent representation or encoding z {\displaystyle z} . According to the chain rule , the equation can be rewritten as In the vanilla variational autoencoder, z {\displaystyle z}
3526-452: The variational autoencoder (VAE), U-Net , and an optional text encoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space , capturing a more fundamental semantic meaning of the image. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain
Wombo - Misplaced Pages Continue
3612-834: The ELBO θ ∗ , ϕ ∗ = argmax θ , ϕ L θ , ϕ ( x ) {\displaystyle \theta ^{*},\phi ^{*}={\underset {\theta ,\phi }{\operatorname {argmax} }}\,L_{\theta ,\phi }(x)} is equivalent to simultaneously maximizing ln p θ ( x ) {\displaystyle \ln p_{\theta }(x)} and minimizing D K L ( q ϕ ( z | x ) ∥ p θ ( z | x ) ) {\displaystyle D_{KL}(q_{\phi }({z|x})\parallel p_{\theta }({z|x}))} . That is, maximizing
3698-495: The English High Court, alleging significant infringement of its intellectual property rights. Getty Images claims that Stability AI "scraped" millions of images from Getty’s websites without consent and used these images to train and develop its deep-learning Stable Diffusion model. Key points of the lawsuit include: The trial is expected to take place in summer 2025 and has significant implications for UK copyright law and
3784-530: The LAION database. The model is insufficiently trained to understand human limbs and faces due to the lack of representative features in the database, and prompting the model to generate images of such type can confound the model. Stable Diffusion XL (SDXL) version 1.0, released in July 2023, introduced native 1024x1024 resolution and improved generation for limbs and text. Accessibility for individual developers can also be
3870-516: The Stable Diffusion model. Inpainting involves selectively modifying a portion of an existing image delineated by a user-provided layer mask , which fills the masked space with newly generated content based on the provided prompt. A dedicated model specifically fine-tuned for inpainting use-cases was created by Stability AI alongside the release of Stable Diffusion 2.0. Conversely, outpainting extends an image beyond its original dimensions, filling
3956-503: The amount of noise added to the output image. A higher strength value produces more variation within the image but may produce an image that is not semantically consistent with the prompt provided. There are different methods for performing img2img. The main method is SDEdit, which first adds noise to an image, then denoises it as usual in text2img. The ability of img2img to add noise to the original image makes it potentially useful for data anonymization and data augmentation , in which
4042-455: The app. The sudden boom in deepfake technology has been described as "a cultural tipping point we aren't ready for", as it is now possible to create a deepfake from any picture off social media in a very short amount of time. It shut down on May 3rd, 2023. Stable Diffusion Stable Diffusion is a deep learning , text-to-image model released in 2022 based on diffusion techniques. The generative artificial intelligence technology
4128-577: The architecture to other domains and improve its performance. β {\displaystyle \beta } -VAE is an implementation with a weighted Kullback–Leibler divergence term to automatically discover and interpret factorised latent representations. With this implementation, it is possible to force manifold disentanglement for β {\displaystyle \beta } values greater than one. This architecture can discover disentangled latent factors without supervision. The conditional VAE (CVAE), inserts label information in
4214-424: The assumed prior of the data. For example, a standard VAE task such as IMAGENET is typically assumed to have a gaussianly distributed noise; however, tasks such as binarized MNIST require a Bernoulli noise. The KL-D from the free energy expression maximizes the probability mass of the q-distribution that overlaps with the p-distribution, which unfortunately can result in mode-seeking behaviour. The "reconstruction" term
4300-415: The capabilities of Stable Diffusion into the hands of the public would result in the technology providing a net benefit, in spite of the potential negative consequences. In addition, Mostaque argues that the intention behind the open availability of Stable Diffusion is to end corporate control and dominance over such technologies, who have previously only developed closed AI systems for image synthesis. This
4386-512: The creators note that the model lacks data from other communities and cultures. The model gives more accurate results for prompts that are written in English in comparison to those written in other languages, with western or white cultures often being the default representation. To address the limitations of the model's initial training, end-users may opt to implement additional training to fine-tune generation outputs to match more specific use-cases,
SECTION 50
#17328594989004472-413: The data points themselves, and outputs parameters for the variational distribution. As it maps from a known input space to the low-dimensional latent space, it is called the encoder. The decoder is the second neural network of this model. It is a function that maps from the latent space to the input space, e.g. as the means of the noise distribution. It is possible to use another neural network that maps to
4558-951: The distribution of x {\displaystyle x} conditional on z {\displaystyle z} to be a Gaussian distribution centered on D θ ( z ) {\displaystyle D_{\theta }(z)} . The distribution of q ϕ ( z | x ) {\displaystyle q_{\phi }(z|x)} and p θ ( z ) {\displaystyle p_{\theta }(z)} are often also chosen to be Gaussians as z | x ∼ N ( E ϕ ( x ) , σ ϕ ( x ) 2 I ) {\displaystyle z|x\sim {\mathcal {N}}(E_{\phi }(x),\sigma _{\phi }(x)^{2}I)} and z ∼ N ( 0 , I ) {\displaystyle z\sim {\mathcal {N}}(0,I)} , with which we obtain by
4644-787: The following, equivalent form, is: L θ , ϕ ( x ) = E z ∼ q ϕ ( ⋅ | x ) [ ln p θ ( x | z ) ] − D K L ( q ϕ ( ⋅ | x ) ∥ p θ ( ⋅ ) ) {\displaystyle L_{\theta ,\phi }(x)=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln p_{\theta }(x|z)\right]-D_{KL}(q_{\phi }({\cdot |x})\parallel p_{\theta }(\cdot ))} where ln p θ ( x | z ) {\displaystyle \ln p_{\theta }(x|z)}
4730-999: The formula for KL divergence of Gaussians : L θ , ϕ ( x ) = − 1 2 E z ∼ q ϕ ( ⋅ | x ) [ ‖ x − D θ ( z ) ‖ 2 2 ] − 1 2 ( N σ ϕ ( x ) 2 + ‖ E ϕ ( x ) ‖ 2 2 − 2 N ln σ ϕ ( x ) ) + C o n s t {\displaystyle L_{\theta ,\phi }(x)=-{\frac {1}{2}}\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\|x-D_{\theta }(z)\|_{2}^{2}\right]-{\frac {1}{2}}\left(N\sigma _{\phi }(x)^{2}+\|E_{\phi }(x)\|_{2}^{2}-2N\ln \sigma _{\phi }(x)\right)+Const} Here N {\displaystyle N}
4816-437: The generated output. ControlNet is a neural network architecture designed to manage diffusion models by incorporating additional conditions. It duplicates the weights of neural network blocks into a "locked" copy and a "trainable" copy. The "trainable" copy learns the desired condition, while the "locked" copy preserves the original model. This approach ensures that training with small datasets of image pairs does not compromise
4902-885: The gradient, allowing stochastic gradient descent . Since we reparametrized z {\displaystyle z} , we need to find q ϕ ( z | x ) {\displaystyle q_{\phi }(z|x)} . Let q 0 {\displaystyle q_{0}} be the probability density function for ϵ {\displaystyle \epsilon } , then ln q ϕ ( z | x ) = ln q 0 ( ϵ ) − ln | det ( ∂ ϵ z ) | {\displaystyle \ln q_{\phi }(z|x)=\ln q_{0}(\epsilon )-\ln |\det(\partial _{\epsilon }z)|} where ∂ ϵ z {\displaystyle \partial _{\epsilon }z}
4988-621: The idea is to jointly optimize the generative model parameters θ {\displaystyle \theta } to reduce the reconstruction error between the input and the output, and ϕ {\displaystyle \phi } to make q ϕ ( z | x ) {\displaystyle q_{\phi }({z|x})} as close as possible to p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} . As reconstruction loss, mean squared error and cross entropy are often used. As distance loss between
5074-735: The integrity of production-ready diffusion models. The "zero convolution" is a 1×1 convolution with both weight and bias initialized to zero. Before training, all zero convolutions produce zero output, preventing any distortion caused by ControlNet. No layer is trained from scratch; the process is still fine-tuning, keeping the original model secure. This method enables training on small-scale or even personal devices. Stability provides an online image generation service called DreamStudio . The company also released an open source version of DreamStudio called StableStudio . In addition to Stability's interfaces, many third party open source interfaces exist, such as AUTOMATIC1111 Stable Diffusion Web UI , which
5160-466: The latent space to force a deterministic constrained representation of the learned data. Some structures directly deal with the quality of the generated samples or implement more than one latent space to further improve the representation learning. Some architectures mix VAE and generative adversarial networks to obtain hybrid models. After the initial work of Diederik P. Kingma and Max Welling . several procedures were proposed to formulate in
5246-790: The licensing of AI-generated content. Unlike models like DALL-E , Stable Diffusion makes its source code available , along with the model (pretrained weights). Prior to Stable Diffusion 3, it applied the Creative ML OpenRAIL-M license, a form of Responsible AI License (RAIL), to the model (M). The license prohibits certain use cases, including crime, libel , harassment , doxing , " exploiting ... minors ", giving medical advice, automatically creating legal obligations, producing legal evidence, and "discriminating against or harming individuals or groups based on ... social behavior or ... personal or personality characteristics ... [or] legally protected characteristics or categories ". The user owns
SECTION 60
#17328594989005332-433: The log-likelihood of the observed data, and minimizing the divergence of the approximate posterior q ϕ ( ⋅ | x ) {\displaystyle q_{\phi }(\cdot |x)} from the exact posterior p θ ( ⋅ | x ) {\displaystyle p_{\theta }(\cdot |x)} . The form given is not very convenient for maximization, but
5418-610: The model provided that the image content is not illegal or harmful to individuals. The images Stable Diffusion was trained on have been filtered without human input, leading to some harmful images and large amounts of private and sensitive information appearing in the training data. More traditional visual artists have expressed concern that widespread usage of image synthesis software such as Stable Diffusion may eventually lead to human artists, along with photographers, models, cinematographers, and actors, gradually losing commercial viability against AI-based competitors. Stable Diffusion
5504-436: The model were trained on a dataset that consists of 512×512 resolution images, meaning that the quality of generated images noticeably degrades when user specifications deviate from its "expected" 512×512 resolution; the version 2.0 update of the Stable Diffusion model later introduced the ability to natively generate images at 768×768 resolution. Another challenge is in generating human limbs due to poor data quality of limbs in
5590-527: The model's training data identified that out of a smaller subset of 12 million images taken from the original wider dataset used, approximately 47% of the sample size of images came from 100 different domains, with Pinterest taking up 8.5% of the subset, followed by websites such as WordPress , Blogspot , Flickr , DeviantArt and Wikimedia Commons . An investigation by Bayerischer Rundfunk showed that LAION's datasets, hosted on Hugging Face, contain large amounts of private and sensitive data. The model
5676-435: The previously empty space with content generated based on the provided prompt. A depth-guided model, named "depth2img", was introduced with the release of Stable Diffusion 2.0 on November 24, 2022; this model infers the depth of the provided input image, and generates a new output image based on both the text prompt and the depth information, which allows the coherence and depth of the original input image to be maintained in
5762-405: The problem is to find a good probabilistic autoencoder, in which the conditional likelihood distribution p θ ( x | z ) {\displaystyle p_{\theta }(x|z)} is computed by the probabilistic decoder , and the approximated posterior distribution q ϕ ( z | x ) {\displaystyle q_{\phi }(z|x)}
5848-406: The prompt are "negative prompts". Negative prompts are a feature included in some front-end implementations, including Stability AI's own DreamStudio cloud service, and allow the user to specify prompts which the model should avoid during image generation. The specified prompts may be undesirable image features that would otherwise be present within image outputs due to the positive prompts provided by
5934-416: The prompt. Generated images are tagged with an invisible digital watermark to allow users to identify an image as generated by Stable Diffusion, although this watermark loses its efficacy if the image is resized or rotated. Each txt2img generation will involve a specific seed value which affects the output image. Users may opt to randomize the seed in order to explore different generated outputs, or use
6020-479: The relationships between the input data and its latent representation as Unfortunately, the computation of p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} is expensive and in most cases intractable. To speed up the calculus to make it feasible, it is necessary to introduce a further function to approximate the posterior distribution as with ϕ {\displaystyle \phi } defined as
6106-463: The rights of millions of artists by training AI tools on five billion images scraped from the web without the consent of the original artists. In July 2023, U.S. District Judge William Orrick inclined to dismiss most of the lawsuit filed by Andersen, McKernan, and Ortiz but allowed them to file a new complaint, providing them an opportunity to reframe their arguments. In January 2023, Getty Images initiated legal proceedings against Stability AI in
6192-583: The rights to their generated output images, and is free to use them commercially. Stable Diffusion 3.5 applies the permissive Stability AI Community License while commercial enterprises with revenue exceed $ 1 million need the Stability AI Enterprise License. As with the OpenRAIL-M license, the user retains the rights to their generated output images and is free to use them commercially. Variational autoencoder In machine learning ,
6278-452: The same seed to obtain the same image output as a previously generated image. Users are also able to adjust the number of inference steps for the sampler; a higher value takes a longer duration of time, however a smaller value may result in visual defects. Another configurable option, the classifier-free guidance scale value, allows the user to adjust how closely the output image adheres to the prompt. More experimentative use cases may opt for
6364-418: The set of real values that parametrize q {\displaystyle q} . This is sometimes called amortized inference , since by "investing" in finding a good q ϕ {\displaystyle q_{\phi }} , one can later infer z {\displaystyle z} from x {\displaystyle x} quickly without doing any integrals. In this way,
6450-663: The square aspect ratio like previous versions). The SD XL Refiner, released at the same time, has the same architecture as SD XL, but it was trained for adding fine details to preexisting images via text-conditional img2img. The 3.0 version completely changes the backbone. Not a UNet, but a Rectified Flow Transformer , which implements the rectified flow method with a Transformer . The Transformer architecture used for SD 3.0 has three "tracks", for original text encoding, transformed text encoding, and image encoding (in latent space). The transformed text encoding and image encoding are mixed during each transformer block. The architecture
6536-503: The text encoder, Stable Diffusion is considered relatively lightweight by 2022 standards, and unlike other diffusion models, it can run on consumer GPUs, and even CPU -only if using the OpenVINO version of Stable Diffusion. The XL version uses the same LDM architecture as previous versions, except larger: larger UNet backbone, larger cross-attention context, two text encoders instead of one, and trained on multiple aspect ratios (not just
6622-664: The two distributions the Kullback–Leibler divergence D K L ( q ϕ ( z | x ) ∥ p θ ( z | x ) ) {\displaystyle D_{KL}(q_{\phi }({z|x})\parallel p_{\theta }({z|x}))} is a good choice to squeeze q ϕ ( z | x ) {\displaystyle q_{\phi }({z|x})} under p θ ( z | x ) {\displaystyle p_{\theta }(z|x)} . The distance loss just defined
6708-1427: The typical method is gradient ascent . It is straightforward to find ∇ θ E z ∼ q ϕ ( ⋅ | x ) [ ln p θ ( x , z ) q ϕ ( z | x ) ] = E z ∼ q ϕ ( ⋅ | x ) [ ∇ θ ln p θ ( x , z ) q ϕ ( z | x ) ] {\displaystyle \nabla _{\theta }\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }({z|x})}}\right]=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\nabla _{\theta }\ln {\frac {p_{\theta }(x,z)}{q_{\phi }({z|x})}}\right]} However, ∇ ϕ E z ∼ q ϕ ( ⋅ | x ) [ ln p θ ( x , z ) q ϕ ( z | x ) ] {\displaystyle \nabla _{\phi }\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }({z|x})}}\right]} does not allow one to put
6794-421: The user, or due to how the model was originally trained, with mangled human hands being a common example. Stable Diffusion also includes another sampling script, "img2img", which consumes a text prompt, path to an existing image, and strength value between 0.0 and 1.0. The script outputs a new image based on the original image that also features elements provided within the text prompt. The strength value denotes
6880-457: The variance, however this can be omitted for simplicity. In such a case, the variance can be optimized with gradient descent. To optimize this model, one needs to know two terms: the "reconstruction error", and the Kullback–Leibler divergence (KL-D). Both terms are derived from the free energy expression of the probabilistic model, and therefore differ depending on the noise distribution and
6966-578: The visual features of image data are changed and anonymized. The same process may also be useful for image upscaling, in which the resolution of an image is increased, with more detail potentially being added to the image. Additionally, Stable Diffusion has been experimented with as a tool for image compression. Compared to JPEG and WebP , the recent methods used for image compression in Stable Diffusion face limitations in preserving small text and faces. Additional use-cases for image modification via img2img are offered by numerous front-end implementations of
7052-502: The web, where 5 billion image-text pairs were classified based on language and filtered into separate datasets by resolution, a predicted likelihood of containing a watermark, and predicted "aesthetic" score (e.g. subjective visual quality). The dataset was created by LAION , a German non-profit which receives funding from Stability AI. The Stable Diffusion model was trained on three subsets of LAION-5B: laion2B-en, laion-high-resolution, and laion-aesthetics v2 5+. A third-party analysis of
7138-406: The weights in float16 precision instead of the default float32 to tradeoff model performance with lower VRAM usage. The text to image sampling script within Stable Diffusion, known as "txt2img", consumes a text prompt in addition to assorted option parameters covering sampling types, output image dimensions, and seed values. The script outputs an image file based on the model's interpretation of
7224-399: Was initially designed for unsupervised learning , its effectiveness has been proven for semi-supervised learning and supervised learning . A variational autoencoder is a generative model with a prior and noise distribution respectively. Usually such models are trained using the expectation-maximization meta-algorithm (e.g. probabilistic PCA , (spike & slab) sparse coding). Such
7310-565: Was initially trained on the laion2B-en and laion-high-resolution subsets, with the last few rounds of training done on LAION-Aesthetics v2 5+, a subset of 600 million captioned images which the LAION-Aesthetics Predictor V2 predicted that humans would, on average, give a score of at least 5 out of 10 when asked to rate how much they liked them. The LAION-Aesthetics v2 5+ subset also excluded low-resolution images and images which LAION-5B-WatermarkDetection identified as carrying
7396-416: Was led by Patrick Esser of Runway and Robin Rombach of CompVis, who were among the researchers who had earlier invented the latent diffusion model architecture used by Stable Diffusion. Stability AI also credited EleutherAI and LAION (a German nonprofit which assembled the dataset on which Stable Diffusion was trained) as supporters of the project. Models in Stable Diffusion series before SD 3 all used
#899100