by diffusers
Open source · 208k downloads · 364 likes
Stable Diffusion XL 1.0 Inpainting 0.1 is an AI-powered image generation model capable of creating photorealistic visuals from text descriptions, featuring advanced targeted retouching functionality. Using a masking system, it allows for the modification or completion of specific areas within an image while preserving the rest of the content, ensuring high precision in adjustments. Ideal for artists, designers, or content creators, it excels at altering elements such as backgrounds, objects, or details without disrupting the overall composition. The model stands out for its ability to seamlessly integrate text-suggested modifications while maintaining visual coherence. Its applications span artistic creation, professional image editing, and visual experimentation, though it does not guarantee absolute accuracy or realism.
license: openrail++ base_model: stabilityai/stable-diffusion-xl-base-1.0 tags:

SD-XL Inpainting 0.1 is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input, with the extra capability of inpainting the pictures by using a mask.
The SD-XL Inpainting 0.1 was initialized with the stable-diffusion-xl-base-1.0 weights. The model is trained for 40k steps at resolution 1024x1024 and 5% dropping of the text-conditioning to improve classifier-free classifier-free guidance sampling. For inpainting, the UNet has 5 additional input channels (4 for the encoded masked-image and 1 for the mask itself) whose weights were zero-initialized after restoring the non-inpainting checkpoint. During training, we generate synthetic masks and, in 25% mask everything.
from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image
import torch
pipe = AutoPipelineForInpainting.from_pretrained("diffusers/stable-diffusion-xl-1.0-inpainting-0.1", torch_dtype=torch.float16, variant="fp16").to("cuda")
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
image = load_image(img_url).resize((1024, 1024))
mask_image = load_image(mask_url).resize((1024, 1024))
prompt = "a tiger sitting on a park bench"
generator = torch.Generator(device="cuda").manual_seed(0)
image = pipe(
prompt=prompt,
image=image,
mask_image=mask_image,
guidance_scale=8.0,
num_inference_steps=20, # steps between 15 and 30 work well for us
strength=0.99, # make sure to use `strength` below 1.0
generator=generator,
).images[0]
How it works:
image | mask_image |
|---|---|
![]() | ![]() |
prompt | Output |
|---|---|
| a tiger sitting on a park bench | ![]() |
The model is intended for research purposes only. Possible research areas and tasks include
Excluded uses are described below.
The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.