par enterprise-explorers
Open source · 6k downloads · 2 likes
Ce modèle est une version améliorée d'un autoencodeur (VAE) conçu pour Stable Diffusion, optimisé pour une meilleure reconstruction d'images. Il s'agit d'une variante affinée du VAE original, spécialement entraînée sur un mélange d'images esthétiques et de portraits humains pour améliorer la qualité des visages et des détails fins. Deux versions sont disponibles : l'une privilégiant un équilibre entre précision et perception (ft-EMA), l'autre optimisée pour des sorties plus lisses et une reconstruction plus fidèle (ft-MSE). Ces modèles servent de remplacement direct pour l'autoencodeur standard de Stable Diffusion, offrant des résultats plus nets et plus réalistes, notamment sur les images complexes ou les visages. Ils sont particulièrement utiles pour les applications nécessitant une haute fidélité visuelle, comme la génération ou l'édition d'images par IA.
These weights are intended to be used with the 🧨 diffusers library. If you are looking for the model to use with the original CompVis Stable Diffusion codebase, come here.
This is a Flax version of the original weights
We publish two kl-f8 autoencoder versions, finetuned from the original kl-f8 autoencoder on a 1:1 ratio of LAION-Aesthetics and LAION-Humans, an unreleased subset containing only SFW images of humans. The intent was to fine-tune on the Stable Diffusion training set (the autoencoder was originally trained on OpenImages) but also enrich the dataset with images of humans to improve the reconstruction of faces. The first, ft-EMA, was resumed from the original checkpoint, trained for 313198 steps and uses EMA weights. It uses the same loss configuration as the original checkpoint (L1 + LPIPS). The second, ft-MSE, was resumed from ft-EMA and uses EMA weights and was trained for another 280k steps using a different loss, with more emphasis on MSE reconstruction (MSE + 0.1 * LPIPS). It produces somewhat ``smoother'' outputs. The batch size for both versions was 192 (16 A100s, batch size 12 per GPU). To keep compatibility with existing models, only the decoder part was finetuned; the checkpoints can be used as a drop-in replacement for the existing autoencoder.
Original kl-f8 VAE vs f8-ft-EMA vs f8-ft-MSE
| Model | train steps | rFID | PSNR | SSIM | PSIM | Link | Comments |
|---|---|---|---|---|---|---|---|
| original | 246803 | 4.99 | 23.4 +/- 3.8 | 0.69 +/- 0.14 | 1.01 +/- 0.28 | https://ommer-lab.com/files/latent-diffusion/kl-f8.zip | as used in SD |
| ft-EMA | 560001 | 4.42 | 23.8 +/- 3.9 | 0.69 +/- 0.13 | 0.96 +/- 0.27 | https://huggingface.co/stabilityai/sd-vae-ft-ema-original/resolve/main/vae-ft-ema-560000-ema-pruned.ckpt | slightly better overall, with EMA |
| ft-MSE | 840001 | 4.70 | 24.5 +/- 3.7 | 0.71 +/- 0.13 | 0.92 +/- 0.27 | https://huggingface.co/stabilityai/sd-vae-ft-mse-original/resolve/main/vae-ft-mse-840000-ema-pruned.ckpt | resumed with EMA from ft-EMA, emphasis on MSE (rec. loss = MSE + 0.1 * LPIPS), smoother outputs |
| Model | train steps | rFID | PSNR | SSIM | PSIM | Link | Comments |
|---|---|---|---|---|---|---|---|
| original | 246803 | 2.61 | 26.0 +/- 4.4 | 0.81 +/- 0.12 | 0.75 +/- 0.36 | https://ommer-lab.com/files/latent-diffusion/kl-f8.zip | as used in SD |
| ft-EMA | 560001 | 1.77 | 26.7 +/- 4.8 | 0.82 +/- 0.12 | 0.67 +/- 0.34 | https://huggingface.co/stabilityai/sd-vae-ft-ema-original/resolve/main/vae-ft-ema-560000-ema-pruned.ckpt | slightly better overall, with EMA |
| ft-MSE | 840001 | 1.88 | 27.3 +/- 4.7 | 0.83 +/- 0.11 | 0.65 +/- 0.34 | https://huggingface.co/stabilityai/sd-vae-ft-mse-original/resolve/main/vae-ft-mse-840000-ema-pruned.ckpt | resumed with EMA from ft-EMA, emphasis on MSE (rec. loss = MSE + 0.1 * LPIPS), smoother outputs |
Visualization of reconstructions on 256x256 images from the COCO2017 validation dataset.
256x256: ft-EMA (left), ft-MSE (middle), original (right)