AI/EXPLORER
ToolsCategoriesSitesLLMsCompareAI QuizAlternativesPremium
—AI Tools
—Sites & Blogs
—LLMs & Models
—Categories
AI Explorer

Find and compare the best artificial intelligence tools for your projects.

Made within France

Explore

  • ›All tools
  • ›Sites & Blogs
  • ›LLMs & Models
  • ›Compare
  • ›Chatbots
  • ›AI Images
  • ›Code & Dev

Company

  • ›Premium
  • ›About
  • ›Contact
  • ›Blog

Legal

  • ›Legal notice
  • ›Privacy
  • ›Terms

© 2026 AI Explorer·All rights reserved.

HomeLLMsZamba2 1.2B instruct

Zamba2 1.2B instruct

by Zyphra

Open source · 93k downloads · 30 likes

1.9
(30 reviews)ChatAPI & Local
About

Zamba2 1.2B Instruct is an AI model designed to follow instructions and engage in natural conversations. Thanks to its hybrid architecture combining state-space blocks (Mamba2) and optimized transformers, it delivers high performance in text generation while remaining compact and efficient. It stands out for its fast execution speed and low memory consumption, often outperforming models twice its size in terms of responsiveness and response quality. Ideal for applications requiring smooth and reactive interactions, such as conversational assistants or writing assistance tools, it combines precision and efficiency for real-time use.

Documentation

Model Card for Zamba2-1.2B

Zamba2-1.2B-instruct is obtained from Zamba2-1.2B by fine-tuning on instruction-following and chat datasets. Specifically:

  1. SFT of the base Zamba2-1.2B model on ultrachat_200k and Infinity-Instruct
  2. DPO of the SFT checkpoint on ultrafeedback_binarized, orca_dpo_pairs, and OpenHermesPreferences

Zamba2-1.2B-Instruct is a hybrid model composed of state-space (Mamba2) and transformer blocks.

Quick start

Prerequisites

To download Zamba2-1.2B-instruct, install transformers from source:

  1. git clone https://github.com/huggingface/transformers.git
  2. cd transformers && pip install .

To install dependencies necessary to run Mamba2 kernels, install mamba-ssm from source (due to compatibility issues with PyTorch) as well as causal-conv1d:

  1. git clone https://github.com/state-spaces/mamba.git
  2. cd mamba && git checkout v2.1.0 && pip install .
  3. pip install causal-conv1d

You can run the model without using the optimized Mamba2 kernels, but it is not recommended as it will result in significantly higher latency and memory usage.

Inference

Python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Instantiate model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Zyphra/Zamba2-1.2B-instruct")
model = AutoModelForCausalLM.from_pretrained("Zyphra/Zamba2-1.2B-instruct", device_map="cuda", torch_dtype=torch.bfloat16)

# Format the input as a chat template
prompt = "What factors contributed to the fall of the Roman Empire?"
sample = [{'role': 'user', 'content': prompt}]
chat_sample = tokenizer.apply_chat_template(sample, tokenize=False)

# Tokenize input and generate output
input_ids = tokenizer(chat_sample, return_tensors='pt', add_special_tokens=False).to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=150, return_dict_in_generate=False, output_scores=False, use_cache=True, num_beams=1, do_sample=False)
print((tokenizer.decode(outputs[0])))

Performance

Zamba2-1.2B-Instruct achieves leading instruction-following and multi-turn chat performance for a model of its size and matches strong models significantly larger. For instance, Zamba2-1.2B-Instruct outperforms Gemma2-2B-Instruct, a very strong model over 2x its size.

ModelSizeAggregate MT-BenchIFEval
Zamba2-1.2B-Instruct1.2B59.5341.45
Gemma2-2B-Instruct2.7B51.6942.20
H2O-Danube-1.8B-Chat1.6B49.7827.95
StableLM-1.6B-Chat1.6B49.8733.77
SmolLM-1.7B-Instruct1.7B43.3716.53
Qwen2-1.5B-Instruct1.5BN/A34.68

Moreover, due to its unique hybrid SSM architecture, Zamba2-1.2B-Instruct achieves extremely low inference latency and rapid generation with a significantly smaller memory footprint than comparable transformer-based models.

Zamba performance
Time to First Token (TTFT)Output Generation
image/pngimage/png

And memory overhead

Zamba inference and memory cost

Model Details

Zamba2-1.2B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba2 layers interleaved with one or more shared attention layers. This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared transformer blocks to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.

Zamba architecture

Note: this is a temporary HuggingFace implementation of Zamba2-1.2B. It may not yet be fully compatible with all frameworks and tools intended to interface with HuggingFace models.

A standalone Pytorch implementation of Zamba2-1.2B may be found here.

Capabilities & Tags
transformerssafetensorszamba2text-generationconversationalendpoints_compatible
Links & Resources
Specifications
CategoryChat
AccessAPI & Local
LicenseOpen Source
PricingOpen Source
Parameters2B parameters
Rating
1.9

Try Zamba2 1.2B instruct

Access the model directly