AI ExplorerAI Explorer
OutilsCatégoriesSitesLLMsComparerQuiz IAAlternativesPremium

—

Outils IA

—

Sites & Blogs

—

LLMs & Modèles

—

Catégories

AI Explorer

Trouvez et comparez les meilleurs outils d'intelligence artificielle pour vos projets.

Fait avecen France

Explorer

  • Tous les outils
  • Sites & Blogs
  • LLMs & Modèles
  • Comparer
  • Chatbots
  • Images IA
  • Code & Dev

Entreprise

  • Premium
  • À propos
  • Contact
  • Blog

Légal

  • Mentions légales
  • Confidentialité
  • CGV

© 2026 AI Explorer. Tous droits réservés.

AccueilLLMsLFM2.5 Audio 1.5B ONNX

LFM2.5 Audio 1.5B ONNX

par LiquidAI

Open source · 339 downloads · 16 likes

1.5
(16 avis)AudioAPI & Local
À propos

LFM2.5 Audio 1.5B ONNX est un modèle multimodal optimisé pour l'IA générative, capable de traiter à la fois le texte et l'audio. Il excelle dans trois tâches principales : la reconnaissance vocale (ASR) pour transcrire l'audio en texte, la synthèse vocale (TTS) pour générer de l'audio à partir de texte, et un mode mixte permettant des interactions alternant texte et audio. Conçu pour une inférence performante sur différentes plateformes, il se distingue par sa polyvalence et son efficacité, notamment grâce à des variantes optimisées pour le WebGPU ou les serveurs. Idéal pour des applications nécessitant une compréhension et une production audio-textuelle fluide, il s'adapte aussi bien aux environnements locaux qu'aux déploiements cloud. Son architecture légère et ses capacités d'auto-régression en font un outil puissant pour des solutions interactives ou automatisées.

Documentation
Liquid AI
Try LFM • Documentation • LEAP

LFM2.5-Audio-1.5B-ONNX

ONNX export of LFM2.5-Audio-1.5B for cross-platform inference.

LFM2.5-Audio is a multimodal model supporting three modes:

  • ASR (Automatic Speech Recognition): Audio → Text
  • TTS (Text-to-Speech): Text → Audio
  • Interleaved: Mixed text and audio input/output

Recommended Variants

DecoderVocoderSizePlatformUse Case
Q4Q4~1.5GBWebGPU, ServerRecommended for most uses
FP16FP16~3.2GBServerHigher quality
  • WebGPU: Use Q4 decoder + Q4 vocoder (Q8 not supported)
  • Server: Q4 for efficiency, FP16 for quality

Model Files

Python
onnx/
├── decoder.onnx                    # LFM2 backbone (FP32)
├── decoder.onnx_data*
├── decoder_fp16.onnx               # LFM2 backbone (FP16)
├── decoder_fp16.onnx_data*
├── decoder_q4.onnx                 # LFM2 backbone (Q4, recommended)
├── decoder_q4.onnx_data
├── audio_encoder.onnx              # Conformer encoder for ASR (FP32)
├── audio_encoder.onnx_data
├── audio_encoder_fp16.onnx         # Conformer encoder (FP16)
├── audio_encoder_fp16.onnx_data
├── audio_encoder_q4.onnx           # Conformer encoder (Q4)
├── audio_encoder_q4.onnx_data
├── audio_embedding.onnx            # Audio code embeddings (FP32)
├── audio_embedding_fp16.onnx       # Audio code embeddings (FP16)
├── audio_embedding_q4.onnx         # Audio code embeddings (Q4)
├── audio_detokenizer.onnx          # Neural vocoder STFT (FP32)
├── audio_detokenizer.onnx_data
├── audio_detokenizer_fp16.onnx     # Neural vocoder (FP16)
├── audio_detokenizer_fp16.onnx_data
├── audio_detokenizer_q4.onnx       # Neural vocoder (Q4)
├── audio_detokenizer_q4.onnx_data
├── vocoder_depthformer.onnx        # Audio codebook prediction (FP32)
├── vocoder_depthformer.onnx_data
├── vocoder_depthformer_fp16.onnx   # Audio codebook prediction (FP16)
├── vocoder_depthformer_fp16.onnx_data
├── vocoder_depthformer_q4.onnx     # Audio codebook prediction (Q4)
├── vocoder_depthformer_q4.onnx_data
├── embed_tokens.bin                # Text embeddings (binary)
├── embed_tokens.json               # Text embeddings metadata
├── audio_embedding.bin             # Audio embeddings (binary, for direct lookup)
├── audio_embedding.json            # Audio embeddings metadata
└── mel_config.json                 # Mel spectrogram configuration

* Large models (>2GB) split weights across multiple files:
  decoder.onnx_data, decoder.onnx_data_1, decoder.onnx_data_2, etc.
  All data files must be in the same directory as the .onnx file.

Python

Use the onnx-export repository for inference.

Installation

Bash
git clone https://github.com/Liquid4All/onnx-export.git
cd onnx-export
uv sync

ASR (Speech Recognition)

Transcribe audio to text:

Bash
uv run lfm2-audio-infer /path/to/LFM2.5-Audio-1.5B-ONNX \
    --mode asr \
    --audio input.wav \
    --precision q4

TTS (Text-to-Speech)

Generate audio from text:

Bash
uv run lfm2-audio-infer /path/to/LFM2.5-Audio-1.5B-ONNX \
    --mode tts \
    --prompt "Hello, this is a test of text to speech synthesis." \
    --output output.wav \
    --precision q4

Options:

  • --system "Perform TTS. Use the UK female voice." - Custom system prompt
  • --audio-temperature 0.8 - Audio sampling temperature
  • --audio-top-k 64 - Top-k sampling for audio

Interleaved (Mixed Audio/Text)

Generate interleaved text and audio response from audio input:

Bash
uv run lfm2-audio-infer /path/to/LFM2.5-Audio-1.5B-ONNX \
    --mode interleaved \
    --audio input.wav \
    --output output.wav \
    --precision q4

Or from text prompt:

Bash
uv run lfm2-audio-infer /path/to/LFM2.5-Audio-1.5B-ONNX \
    --mode interleaved \
    --prompt "Respond with audio" \
    --output output.wav \
    --precision q4

CLI Options

Bash
uv run lfm2-audio-infer --help
OptionDescription
--modeasr, tts, or interleaved
--precisionfp16, q4, or q8 (default: fp32)
--audioInput audio file (WAV)
--outputOutput audio file (WAV)
--promptText prompt
--systemSystem prompt
--max-tokensMaximum tokens to generate
--temperatureText sampling temperature
--audio-temperatureAudio sampling temperature
--audio-top-kTop-k sampling for audio
--seedRandom seed for reproducibility

WebGPU (Browser)

Installation

Bash
npm install onnxruntime-web @huggingface/transformers

Enable WebGPU

WebGPU is required for browser inference. To enable:

  1. Chrome/Edge: Navigate to chrome://flags/#enable-unsafe-webgpu, enable, and restart
  2. Verify: Check chrome://gpu for "WebGPU" status
  3. Test: Run navigator.gpu.requestAdapter() in DevTools console

Inference

JavaScript
import * as ort from "onnxruntime-web/webgpu";
import { AutoTokenizer } from "@huggingface/transformers";

// Check WebGPU availability
if (!navigator.gpu) {
  throw new Error("WebGPU not available. Enable at chrome://flags/#enable-unsafe-webgpu");
}

ort.env.wasm.numThreads = 1;

const modelId = "LiquidAI/LFM2.5-Audio-1.5B-ONNX";
const modelBase = `https://huggingface.co/${modelId}/resolve/main`;

// Load tokenizer
const tokenizer = await AutoTokenizer.from_pretrained(modelId);

// Load ONNX sessions
async function loadSession(name, dataFiles = 1) {
  const onnxPath = `${modelBase}/onnx/${name}.onnx`;
  const externalData = [];
  for (let i = 0; i < dataFiles; i++) {
    const suffix = i === 0 ? "" : `_${i}`;
    const fileName = `${name}.onnx_data${suffix}`;
    externalData.push({ path: fileName, data: `${modelBase}/onnx/${fileName}` });
  }
  return ort.InferenceSession.create(onnxPath, {
    executionProviders: ["webgpu"],
    externalData,
  });
}

// Load models (Q4 recommended for WebGPU)
const decoder = await loadSession("decoder_q4");
const audioEmbedding = await loadSession("audio_embedding_q4");
const detokenizer = await loadSession("audio_detokenizer_q4");
const depthformer = await loadSession("vocoder_depthformer_q4");

// Load text embeddings binary
const embedResponse = await fetch(`${modelBase}/onnx/embed_tokens.bin`);
const embedBuffer = await embedResponse.arrayBuffer();
const embedMetaResponse = await fetch(`${modelBase}/onnx/embed_tokens.json`);
const embedMeta = await embedMetaResponse.json();
const embedWeight = new Float32Array(embedBuffer);

function getTextEmbeddings(ids) {
  const hiddenSize = embedMeta.hidden_size;
  const embeds = new Float32Array(ids.length * hiddenSize);
  for (let i = 0; i < ids.length; i++) {
    const offset = ids[i] * hiddenSize;
    embeds.set(embedWeight.subarray(offset, offset + hiddenSize), i * hiddenSize);
  }
  return new ort.Tensor("float32", embeds, [1, ids.length, hiddenSize]);
}

// Model config
const hiddenSize = 2048;
const numCodebooks = 8;
const codebookVocab = 2049;

// TTS example
const text = "Hello, this is a test.";
const prompt = `<|startoftext|><|im_start|>system
Perform TTS. Use the UK female voice.<|im_end|>
<|im_start|>user
${text}<|im_end|>
<|im_start|>assistant
`;

const inputIds = tokenizer.encode(prompt);
let embeds = getTextEmbeddings(inputIds);

// Initialize KV cache
const cache = {};
for (const name of decoder.inputNames) {
  if (name.startsWith("past_conv")) {
    cache[name] = new ort.Tensor("float32", new Float32Array(hiddenSize * 3), [1, hiddenSize, 3]);
  } else if (name.startsWith("past_key_values")) {
    cache[name] = new ort.Tensor("float32", new Float32Array(0), [1, 8, 0, 64]);
  }
}

// Generation loop
const audioCodes = [];
let inAudioMode = false;
let curLen = inputIds.length;

for (let step = 0; step < 1024; step++) {
  const attentionMask = new ort.Tensor("int64", new BigInt64Array(curLen).fill(1n), [1, curLen]);
  const outputs = await decoder.run({ inputs_embeds: embeds, attention_mask: attentionMask, ...cache });

  // Update cache
  for (const [name, tensor] of Object.entries(outputs)) {
    if (name.startsWith("present_conv")) {
      cache[name.replace("present_conv", "past_conv")] = tensor;
    } else if (name.startsWith("present.")) {
      cache[name.replace("present.", "past_key_values.")] = tensor;
    }
  }

  if (inAudioMode) {
    // Use depthformer to generate audio codes
    const hiddenStates = outputs.hidden_states;
    const lastHidden = /* extract last position */;

    // Autoregressive codebook generation (8 steps per frame)
    const frameCodes = await generateAudioFrame(depthformer, lastHidden);

    if (frameCodes[0] === 2048) {
      // End of audio
      break;
    }

    audioCodes.push(frameCodes);

    // Get audio embeddings for feedback
    const audioTokens = frameCodes.map((code, cb) => cb * codebookVocab + code);
    const audioEmbedsResult = await audioEmbedding.run({
      audio_codes: new ort.Tensor("int64", new BigInt64Array(audioTokens.map(BigInt)), [1, 8])
    });
    // Sum embeddings across codebooks
    embeds = sumEmbeddings(audioEmbedsResult.audio_embeds);
  } else {
    // Text generation
    const logits = outputs.logits;
    const nextToken = argmax(logits);

    if (nextToken === 128) {
      // <|audio_start|> - switch to audio mode
      inAudioMode = true;
    }

    embeds = getTextEmbeddings([nextToken]);
  }

  curLen++;
}

// Decode audio codes to waveform using detokenizer + ISTFT
const waveform = await decodeAudio(detokenizer, audioCodes);

WebGPU Notes

  • Recommended: Q4 models for all components
  • Audio generation is autoregressive: 8 depthformer calls per audio frame
  • Each audio frame = 80ms of audio (24kHz, 320 hop length, 6x upsampling)
  • End-of-audio token is 2048 in any codebook
  • Large models (>2GB) split weights across multiple files

Audio Processing Details

Input (ASR)

  • Sample rate: 16kHz
  • Mel spectrogram: 128 bins, 512 FFT, 160 hop, 400 window
  • Pre-emphasis: 0.97

Output (TTS)

  • Sample rate: 24kHz
  • 8 codebooks with 2049 tokens each (0-2047 audio, 2048 end-of-audio)
  • STFT reconstruction: 1280 FFT, 320 hop
  • Detokenizer provides 6x temporal upsampling

License

This model is released under the LFM 1.0 License.

Liens & Ressources
Spécifications
CatégorieAudio
AccèsAPI & Local
LicenceOpen Source
TarificationOpen Source
Paramètres5B parameters
Note
1.5

Essayer LFM2.5 Audio 1.5B ONNX

Accédez directement au modèle