=Parameters().from_json('../config.json') params
Building Diffuser Pipeline
Learn how to build a diffuser pipeline
Load necessary modules
Setup the notebook’s configuration parameters
= device_by_name("Tesla")
params.gpu = 512 # default height of Stable Diffusion
params.height = 512 # default width of Stable Diffusion
params.width = 25 # Number of denoising steps
params.num_inference_steps = 7.5 # Scale for classifier-free guidance
params.guidance_scale = 4356 params.seed
we will use the gpu
=params.gpu device
load and define the diffusers’s building blocks
= AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder = "vae").to(device)
vae = CLIPTokenizer.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder = "tokenizer")
tokenizer = CLIPTextModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder = "text_encoder").to(device)
text_encoder = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder = "unet").to(device) unet
Define the scheduler
= UniPCMultistepScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler") scheduler
Setup prompt
= ["a photograph of a cute puppy"]
prompt = torch.manual_seed(params.seed) # Seed generator to create the inital latent noise
generator = len(prompt) batch_size
calculate the embeddings from the prompt
= tokenizer(
text_input ="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt"
prompt, padding
)
with torch.no_grad():
= text_encoder(text_input.input_ids.to(device))[0] text_embeddings
calculate the embeddings for an empty prompt
= text_input.input_ids.shape[-1]
max_length = tokenizer([""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt")
uncond_input = text_encoder(uncond_input.input_ids.to(device))[0] uncond_embeddings
The model’s input is a concatenation of the prompt’s and the null prompt’s embeddings
= torch.cat([uncond_embeddings, text_embeddings]) text_embeddings
Generate some initial random noise as a starting point for the diffusion process in the latent space
= torch.randn(
latents // 8, params.width // 8),
(batch_size, unet.in_channels, params.height =generator,
generator
)= latents.to(device) latents
Start by scaling the input with the initial noise distribution, sigma, the noise scale value
= latents * scheduler.init_noise_sigma latents
Here is the denoising loop where the magic is done
scheduler.set_timesteps(params.num_inference_steps)
for t in tqdm(scheduler.timesteps):
# expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
= torch.cat([latents] * 2)
latent_model_input
= scheduler.scale_model_input(latent_model_input, timestep=t)
latent_model_input
# predict the noise residual
with torch.no_grad():
= unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
noise_pred
# perform guidance
= noise_pred.chunk(2)
noise_pred_uncond, noise_pred_text = noise_pred_uncond + params.guidance_scale * (noise_pred_text - noise_pred_uncond)
noise_pred # compute the previous noisy sample x_t -> x_t-1
= scheduler.step(noise_pred, t, latents).prev_sample latents
The final step is to use the vae to decode the latent representation into an image
# scale and decode the image latents with vae
= 1 / 0.18215 * latents
latents with torch.no_grad():
= vae.decode(latents).sample image
Lastly, convert the image to a PIL
= (image / 2 + 0.5).clamp(0, 1)
image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
image = (image * 255).round().astype("uint8")
images = [Image.fromarray(image) for image in images]
pil_images 0] pil_images[