AI/EXPLORER
ToolsCategoriesSitesLLMsCompareAI QuizAlternativesPremium
—AI Tools
—Sites & Blogs
—LLMs & Models
—Categories
AI Explorer

Find and compare the best artificial intelligence tools for your projects.

Made within France

Explore

  • ›All tools
  • ›Sites & Blogs
  • ›LLMs & Models
  • ›Compare
  • ›Chatbots
  • ›AI Images
  • ›Code & Dev

Company

  • ›Premium
  • ›About
  • ›Contact
  • ›Blog

Legal

  • ›Legal notice
  • ›Privacy
  • ›Terms

© 2026 AI Explorer·All rights reserved.

HomeLLMsLFM2.5 Audio 1.5B ONNX

LFM2.5 Audio 1.5B ONNX

by LiquidAI

Open source · 249 downloads · 16 likes

1.5
(16 reviews)AudioAPI & Local
About

LFM2.5 Audio 1.5B ONNX is a multimodal model optimized for generative AI, capable of processing both text and audio. It excels in three core tasks: automatic speech recognition (ASR) for transcribing audio into text, text-to-speech synthesis (TTS) for generating audio from text, and a mixed mode enabling seamless interactions alternating between text and audio. Designed for high-performance inference across diverse platforms, it stands out for its versatility and efficiency, particularly through optimized variants for WebGPU or server environments. Ideal for applications requiring fluid audio-text comprehension and production, it adapts equally well to local deployments and cloud-based solutions. Its lightweight architecture and autoregressive capabilities make it a powerful tool for interactive or automated solutions.

Documentation
Liquid AI
Try LFM • Documentation • LEAP

LFM2.5-Audio-1.5B-ONNX

ONNX export of LFM2.5-Audio-1.5B for cross-platform inference.

LFM2.5-Audio is a multimodal model supporting three modes:

  • ASR (Automatic Speech Recognition): Audio → Text
  • TTS (Text-to-Speech): Text → Audio
  • Interleaved: Mixed text and audio input/output

Recommended Variants

DecoderVocoderSizePlatformUse Case
Q4Q4~1.5GBWebGPU, ServerRecommended for most uses
FP16FP16~3.2GBServerHigher quality
  • WebGPU: Use Q4 decoder + Q4 vocoder (Q8 not supported)
  • Server: Q4 for efficiency, FP16 for quality

Model Files

Python
onnx/
├── decoder.onnx                    # LFM2 backbone (FP32)
├── decoder.onnx_data*
├── decoder_fp16.onnx               # LFM2 backbone (FP16)
├── decoder_fp16.onnx_data*
├── decoder_q4.onnx                 # LFM2 backbone (Q4, recommended)
├── decoder_q4.onnx_data
├── audio_encoder.onnx              # Conformer encoder for ASR (FP32)
├── audio_encoder.onnx_data
├── audio_encoder_fp16.onnx         # Conformer encoder (FP16)
├── audio_encoder_fp16.onnx_data
├── audio_encoder_q4.onnx           # Conformer encoder (Q4)
├── audio_encoder_q4.onnx_data
├── audio_embedding.onnx            # Audio code embeddings (FP32)
├── audio_embedding_fp16.onnx       # Audio code embeddings (FP16)
├── audio_embedding_q4.onnx         # Audio code embeddings (Q4)
├── audio_detokenizer.onnx          # Neural vocoder STFT (FP32)
├── audio_detokenizer.onnx_data
├── audio_detokenizer_fp16.onnx     # Neural vocoder (FP16)
├── audio_detokenizer_fp16.onnx_data
├── audio_detokenizer_q4.onnx       # Neural vocoder (Q4)
├── audio_detokenizer_q4.onnx_data
├── vocoder_depthformer.onnx        # Audio codebook prediction (FP32)
├── vocoder_depthformer.onnx_data
├── vocoder_depthformer_fp16.onnx   # Audio codebook prediction (FP16)
├── vocoder_depthformer_fp16.onnx_data
├── vocoder_depthformer_q4.onnx     # Audio codebook prediction (Q4)
├── vocoder_depthformer_q4.onnx_data
├── embed_tokens.bin                # Text embeddings (binary)
├── embed_tokens.json               # Text embeddings metadata
├── audio_embedding.bin             # Audio embeddings (binary, for direct lookup)
├── audio_embedding.json            # Audio embeddings metadata
└── mel_config.json                 # Mel spectrogram configuration

* Large models (>2GB) split weights across multiple files:
  decoder.onnx_data, decoder.onnx_data_1, decoder.onnx_data_2, etc.
  All data files must be in the same directory as the .onnx file.

Python

Use the onnx-export repository for inference.

Installation

Bash
git clone https://github.com/Liquid4All/onnx-export.git
cd onnx-export
uv sync

ASR (Speech Recognition)

Transcribe audio to text:

Bash
uv run lfm2-audio-infer /path/to/LFM2.5-Audio-1.5B-ONNX \
    --mode asr \
    --audio input.wav \
    --precision q4

TTS (Text-to-Speech)

Generate audio from text:

Bash
uv run lfm2-audio-infer /path/to/LFM2.5-Audio-1.5B-ONNX \
    --mode tts \
    --prompt "Hello, this is a test of text to speech synthesis." \
    --output output.wav \
    --precision q4

Options:

  • --system "Perform TTS. Use the UK female voice." - Custom system prompt
  • --audio-temperature 0.8 - Audio sampling temperature
  • --audio-top-k 64 - Top-k sampling for audio

Interleaved (Mixed Audio/Text)

Generate interleaved text and audio response from audio input:

Bash
uv run lfm2-audio-infer /path/to/LFM2.5-Audio-1.5B-ONNX \
    --mode interleaved \
    --audio input.wav \
    --output output.wav \
    --precision q4

Or from text prompt:

Bash
uv run lfm2-audio-infer /path/to/LFM2.5-Audio-1.5B-ONNX \
    --mode interleaved \
    --prompt "Respond with audio" \
    --output output.wav \
    --precision q4

CLI Options

Bash
uv run lfm2-audio-infer --help
OptionDescription
--modeasr, tts, or interleaved
--precisionfp16, q4, or q8 (default: fp32)
--audioInput audio file (WAV)
--outputOutput audio file (WAV)
--promptText prompt
--systemSystem prompt
--max-tokensMaximum tokens to generate
--temperatureText sampling temperature
--audio-temperatureAudio sampling temperature
--audio-top-kTop-k sampling for audio
--seedRandom seed for reproducibility

WebGPU (Browser)

Installation

Bash
npm install onnxruntime-web @huggingface/transformers

Enable WebGPU

WebGPU is required for browser inference. To enable:

  1. Chrome/Edge: Navigate to chrome://flags/#enable-unsafe-webgpu, enable, and restart
  2. Verify: Check chrome://gpu for "WebGPU" status
  3. Test: Run navigator.gpu.requestAdapter() in DevTools console

Inference

JavaScript
import * as ort from "onnxruntime-web/webgpu";
import { AutoTokenizer } from "@huggingface/transformers";

// Check WebGPU availability
if (!navigator.gpu) {
  throw new Error("WebGPU not available. Enable at chrome://flags/#enable-unsafe-webgpu");
}

ort.env.wasm.numThreads = 1;

const modelId = "LiquidAI/LFM2.5-Audio-1.5B-ONNX";
const modelBase = `https://huggingface.co/${modelId}/resolve/main`;

// Load tokenizer
const tokenizer = await AutoTokenizer.from_pretrained(modelId);

// Load ONNX sessions
async function loadSession(name, dataFiles = 1) {
  const onnxPath = `${modelBase}/onnx/${name}.onnx`;
  const externalData = [];
  for (let i = 0; i < dataFiles; i++) {
    const suffix = i === 0 ? "" : `_${i}`;
    const fileName = `${name}.onnx_data${suffix}`;
    externalData.push({ path: fileName, data: `${modelBase}/onnx/${fileName}` });
  }
  return ort.InferenceSession.create(onnxPath, {
    executionProviders: ["webgpu"],
    externalData,
  });
}

// Load models (Q4 recommended for WebGPU)
const decoder = await loadSession("decoder_q4");
const audioEmbedding = await loadSession("audio_embedding_q4");
const detokenizer = await loadSession("audio_detokenizer_q4");
const depthformer = await loadSession("vocoder_depthformer_q4");

// Load text embeddings binary
const embedResponse = await fetch(`${modelBase}/onnx/embed_tokens.bin`);
const embedBuffer = await embedResponse.arrayBuffer();
const embedMetaResponse = await fetch(`${modelBase}/onnx/embed_tokens.json`);
const embedMeta = await embedMetaResponse.json();
const embedWeight = new Float32Array(embedBuffer);

function getTextEmbeddings(ids) {
  const hiddenSize = embedMeta.hidden_size;
  const embeds = new Float32Array(ids.length * hiddenSize);
  for (let i = 0; i < ids.length; i++) {
    const offset = ids[i] * hiddenSize;
    embeds.set(embedWeight.subarray(offset, offset + hiddenSize), i * hiddenSize);
  }
  return new ort.Tensor("float32", embeds, [1, ids.length, hiddenSize]);
}

// Model config
const hiddenSize = 2048;
const numCodebooks = 8;
const codebookVocab = 2049;

// TTS example
const text = "Hello, this is a test.";
const prompt = `<|startoftext|><|im_start|>system
Perform TTS. Use the UK female voice.<|im_end|>
<|im_start|>user
${text}<|im_end|>
<|im_start|>assistant
`;

const inputIds = tokenizer.encode(prompt);
let embeds = getTextEmbeddings(inputIds);

// Initialize KV cache
const cache = {};
for (const name of decoder.inputNames) {
  if (name.startsWith("past_conv")) {
    cache[name] = new ort.Tensor("float32", new Float32Array(hiddenSize * 3), [1, hiddenSize, 3]);
  } else if (name.startsWith("past_key_values")) {
    cache[name] = new ort.Tensor("float32", new Float32Array(0), [1, 8, 0, 64]);
  }
}

// Generation loop
const audioCodes = [];
let inAudioMode = false;
let curLen = inputIds.length;

for (let step = 0; step < 1024; step++) {
  const attentionMask = new ort.Tensor("int64", new BigInt64Array(curLen).fill(1n), [1, curLen]);
  const outputs = await decoder.run({ inputs_embeds: embeds, attention_mask: attentionMask, ...cache });

  // Update cache
  for (const [name, tensor] of Object.entries(outputs)) {
    if (name.startsWith("present_conv")) {
      cache[name.replace("present_conv", "past_conv")] = tensor;
    } else if (name.startsWith("present.")) {
      cache[name.replace("present.", "past_key_values.")] = tensor;
    }
  }

  if (inAudioMode) {
    // Use depthformer to generate audio codes
    const hiddenStates = outputs.hidden_states;
    const lastHidden = /* extract last position */;

    // Autoregressive codebook generation (8 steps per frame)
    const frameCodes = await generateAudioFrame(depthformer, lastHidden);

    if (frameCodes[0] === 2048) {
      // End of audio
      break;
    }

    audioCodes.push(frameCodes);

    // Get audio embeddings for feedback
    const audioTokens = frameCodes.map((code, cb) => cb * codebookVocab + code);
    const audioEmbedsResult = await audioEmbedding.run({
      audio_codes: new ort.Tensor("int64", new BigInt64Array(audioTokens.map(BigInt)), [1, 8])
    });
    // Sum embeddings across codebooks
    embeds = sumEmbeddings(audioEmbedsResult.audio_embeds);
  } else {
    // Text generation
    const logits = outputs.logits;
    const nextToken = argmax(logits);

    if (nextToken === 128) {
      // <|audio_start|> - switch to audio mode
      inAudioMode = true;
    }

    embeds = getTextEmbeddings([nextToken]);
  }

  curLen++;
}

// Decode audio codes to waveform using detokenizer + ISTFT
const waveform = await decodeAudio(detokenizer, audioCodes);

WebGPU Notes

  • Recommended: Q4 models for all components
  • Audio generation is autoregressive: 8 depthformer calls per audio frame
  • Each audio frame = 80ms of audio (24kHz, 320 hop length, 6x upsampling)
  • End-of-audio token is 2048 in any codebook
  • Large models (>2GB) split weights across multiple files

Audio Processing Details

Input (ASR)

  • Sample rate: 16kHz
  • Mel spectrogram: 128 bins, 512 FFT, 160 hop, 400 window
  • Pre-emphasis: 0.97

Output (TTS)

  • Sample rate: 24kHz
  • 8 codebooks with 2049 tokens each (0-2047 audio, 2048 end-of-audio)
  • STFT reconstruction: 1280 FFT, 320 hop
  • Detokenizer provides 6x temporal upsampling

License

This model is released under the LFM 1.0 License.

Capabilities & Tags
onnxliquidedgelfm2.5-audiolfm2.5onnxruntimewebgputtsasrspeech
Links & Resources
Specifications
CategoryAudio
AccessAPI & Local
LicenseOpen Source
PricingOpen Source
Parameters5B parameters
Rating
1.5

Try LFM2.5 Audio 1.5B ONNX

Access the model directly