AI/EXPLORER
ToolsCategoriesSitesLLMsCompareAI QuizAlternativesPremium
—AI Tools
—Sites & Blogs
—LLMs & Models
—Categories
AI Explorer

Find and compare the best artificial intelligence tools for your projects.

Made within France

Explore

  • ›All tools
  • ›Sites & Blogs
  • ›LLMs & Models
  • ›Compare
  • ›Chatbots
  • ›AI Images
  • ›Code & Dev

Company

  • ›Premium
  • ›About
  • ›Contact
  • ›Blog

Legal

  • ›Legal notice
  • ›Privacy
  • ›Terms

© 2026 AI Explorer·All rights reserved.

HomeLLMsImagesd vae ft mse flax

sd vae ft mse flax

by enterprise-explorers

Open source · 6k downloads · 2 likes

0.6
(2 reviews)ImageAPI & Local
About

This model is an enhanced version of a variational autoencoder (VAE) designed for Stable Diffusion, optimized for improved image reconstruction. It is a refined variant of the original VAE, specifically trained on a mix of aesthetic images and human portraits to enhance the quality of faces and fine details. Two versions are available: one prioritizing a balance between accuracy and perceptual quality (ft-EMA), and another optimized for smoother outputs and more faithful reconstruction (ft-MSE). These models serve as direct replacements for Stable Diffusion’s standard autoencoder, delivering sharper and more realistic results, particularly in complex images or portraits. They are especially useful for applications requiring high visual fidelity, such as AI-generated or edited images.

Documentation

Improved Autoencoders

Utilizing

These weights are intended to be used with the 🧨 diffusers library. If you are looking for the model to use with the original CompVis Stable Diffusion codebase, come here.

This is a Flax version of the original weights

Decoder Finetuning

We publish two kl-f8 autoencoder versions, finetuned from the original kl-f8 autoencoder on a 1:1 ratio of LAION-Aesthetics and LAION-Humans, an unreleased subset containing only SFW images of humans. The intent was to fine-tune on the Stable Diffusion training set (the autoencoder was originally trained on OpenImages) but also enrich the dataset with images of humans to improve the reconstruction of faces. The first, ft-EMA, was resumed from the original checkpoint, trained for 313198 steps and uses EMA weights. It uses the same loss configuration as the original checkpoint (L1 + LPIPS). The second, ft-MSE, was resumed from ft-EMA and uses EMA weights and was trained for another 280k steps using a different loss, with more emphasis on MSE reconstruction (MSE + 0.1 * LPIPS). It produces somewhat ``smoother'' outputs. The batch size for both versions was 192 (16 A100s, batch size 12 per GPU). To keep compatibility with existing models, only the decoder part was finetuned; the checkpoints can be used as a drop-in replacement for the existing autoencoder.

Original kl-f8 VAE vs f8-ft-EMA vs f8-ft-MSE

Evaluation

COCO 2017 (256x256, val, 5000 images)

Modeltrain stepsrFIDPSNRSSIMPSIMLinkComments
original2468034.9923.4 +/- 3.80.69 +/- 0.141.01 +/- 0.28https://ommer-lab.com/files/latent-diffusion/kl-f8.zipas used in SD
ft-EMA5600014.4223.8 +/- 3.90.69 +/- 0.130.96 +/- 0.27https://huggingface.co/stabilityai/sd-vae-ft-ema-original/resolve/main/vae-ft-ema-560000-ema-pruned.ckptslightly better overall, with EMA
ft-MSE8400014.7024.5 +/- 3.70.71 +/- 0.130.92 +/- 0.27https://huggingface.co/stabilityai/sd-vae-ft-mse-original/resolve/main/vae-ft-mse-840000-ema-pruned.ckptresumed with EMA from ft-EMA, emphasis on MSE (rec. loss = MSE + 0.1 * LPIPS), smoother outputs

LAION-Aesthetics 5+ (256x256, subset, 10000 images)

Modeltrain stepsrFIDPSNRSSIMPSIMLinkComments
original2468032.6126.0 +/- 4.40.81 +/- 0.120.75 +/- 0.36https://ommer-lab.com/files/latent-diffusion/kl-f8.zipas used in SD
ft-EMA5600011.7726.7 +/- 4.80.82 +/- 0.120.67 +/- 0.34https://huggingface.co/stabilityai/sd-vae-ft-ema-original/resolve/main/vae-ft-ema-560000-ema-pruned.ckptslightly better overall, with EMA
ft-MSE8400011.8827.3 +/- 4.70.83 +/- 0.110.65 +/- 0.34https://huggingface.co/stabilityai/sd-vae-ft-mse-original/resolve/main/vae-ft-mse-840000-ema-pruned.ckptresumed with EMA from ft-EMA, emphasis on MSE (rec. loss = MSE + 0.1 * LPIPS), smoother outputs

Visual

Visualization of reconstructions on 256x256 images from the COCO2017 validation dataset.


256x256: ft-EMA (left), ft-MSE (middle), original (right)

Capabilities & Tags
transformersstable-diffusionstable-diffusion-diffuserstext-to-image
Links & Resources
Specifications
CategoryImage
AccessAPI & Local
LicenseOpen Source
PricingOpen Source
Rating
0.6

Try sd vae ft mse flax

Access the model directly