AI ExplorerAI Explorer
OutilsCatégoriesSitesLLMsComparerQuiz IAAlternativesPremium

—

Outils IA

—

Sites & Blogs

—

LLMs & Modèles

—

Catégories

AI Explorer

Trouvez et comparez les meilleurs outils d'intelligence artificielle pour vos projets.

Fait avecen France

Explorer

  • Tous les outils
  • Sites & Blogs
  • LLMs & Modèles
  • Comparer
  • Chatbots
  • Images IA
  • Code & Dev

Entreprise

  • Premium
  • À propos
  • Contact
  • Blog

Légal

  • Mentions légales
  • Confidentialité
  • CGV

© 2026 AI Explorer. Tous droits réservés.

AccueilLLMsHyperCLOVAX SEED Omni 8B

HyperCLOVAX SEED Omni 8B

par naver-hyperclovax

Open source · 293k downloads · 186 likes

2.8
(186 avis)ChatAPI & Local
À propos

HyperCLOVAX SEED Omni 8B est un modèle d'IA unifié et multimodal qui fusionne les capacités de traitement du texte, des images et de la parole au sein d'une seule architecture. Il permet des interactions bidirectionnelles entre ces modalités, offrant des fonctionnalités avancées comme la génération et l'édition d'images à partir de texte, la reconnaissance et la traduction vocale, ainsi que la synthèse vocale, le tout dans une fenêtre de contexte de 32 000 tokens. Conçu comme une étape pionnière vers une intelligence "Any-to-Any" centrée sur le coréen, il excelle particulièrement dans les tâches multilingues et multimodales, combinant performance et polyvalence. Ses cas d'usage couvrent la création de contenu visuel, l'assistance vocale, l'analyse d'images ou encore la transcription audio, le rendant adapté aux environnements professionnels et créatifs. Ce qui le distingue, c'est son approche unifiée et son alignement sémantique entre les différentes modalités, offrant une cohérence et une fluidité rares dans les modèles existants.

Documentation

image

Overview

HyperCLOVA X SEED 8B Omni is a unified multimodal model that brings text, vision, and speech together, based on an auto-regressive Transformer architecture, enabling consistent multimodal understanding and generation. SEED 8B Omni aligns textual, visual, and audio representations in a shared semantic space and supports bidirectional interactions across modalities, including established text capabilities as well as vision–language QA, text-to-image generation and editing, speech recognition and translation, and text-to-speech, within a 32K context window. As an early pathfinding milestone of HyperCLOVA X toward Any-to-Any-Korean-First intelligence, SEED 8B Omni serves as a practical exploration of unified multimodal modeling and provides a reference point for future development and scaling.


Technical Report

  • HyperCLOVAX-SEED-Omni-8B Tech Report (PDF)

Basic Information

  • Architecture : Transformer-based omni-model architecture (Dense Model)
  • Parameters : 8B
  • Input Format: Text/Image/Video/Audio(Speech)
  • Output Format: Text/Image/Audio(Speech)
  • Context Length : 32K
  • Knowledge Cutoff: May 2025

Benchmarks

테크니컬 리포트 05_2@2x

  • Text-to-Text : MMLU-Pro, GSM8K, KMMLU-Pro, HAERAE 1.0
  • Vision-to-Text :SEED-IMG, AI2D, K-MMBench
  • Text-to-Vision: GenEval, ImgEdit
  • Audio-to-Text: Librispeech, Ksponspeech
  • Audio-to-Audio:Fleurs en2ko, Fleurs ko2en

Examples

Text-to-Image Generation

hf_img01

Text-based Image Editing

hf_img02 hf_img03 hf_img04


Inference

We provide OmniServe, a production-ready multimodal inference system with OpenAI-compatible API.

Capabilities

  • Inputs: Text, Image, Audio, Video
  • Outputs: Text, Image, Audio (no video generation)

Requirements

  • 4x NVIDIA A100 80GB
  • Docker & Docker Compose
  • NVIDIA Driver 525+, CUDA 12.1+
  • S3-compatible storage (for image/audio output)

Installation

Bash
# Clone OmniServe
git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
cd OmniServe

# Install dependencies
pip install huggingface_hub safetensors torch openai easydict

# Download model (~16GB)
huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B \
    --local-dir ./models/HyperCLOVAX-SEED-Omni-8B

# Convert model to component format
python convert_model.py \
    --input ./models/HyperCLOVAX-SEED-Omni-8B \
    --output ./track_b \
    --track b

# Configure environment
cp .env.example .env
# Edit .env with model paths and S3 credentials

# Build and run (Track B only - OMNI model)
docker compose --profile track-b build
docker compose --profile track-b up -d

# Wait for model loading (~5 minutes)
docker compose logs -f omni

# Note: To run both VLM and OMNI models together:
# docker compose --profile track-a --profile track-b up -d

Basic Usage

Python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/b/v1",
    api_key="not-needed"
)

# Image understanding
response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
                {"type": "text", "text": "What is in this image?"}
            ]
        }
    ],
    max_tokens=256,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

print(response.choices[0].message.content)

More Examples

Text to Image
Python
import json

SYSTEM_PROMPT = """You are an AI assistant that generates images. When asked to draw or create an image, you MUST use the t2i_model_generation tool to generate the image. Always respond by calling the tool."""

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "Draw a sunset over mountains"}
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "t2i_model_generation",
            "description": "Generates an RGB image based on the provided discrete image representation.",
            "parameters": {
                "type": "object",
                "required": ["discrete_image_token"],
                "properties": {
                    "discrete_image_token": {
                        "type": "string",
                        "description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.",
                        "minLength": 1
                    }
                }
            }
        }
    }],
    max_tokens=7000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.tool_calls:
    args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    print(f"Generated image: {args['discrete_image_token']}")
Text to Audio
Python
import base64

# Prompt should explicitly request speech/audio output
response = client.chat.completions.create(
    model="track_b_model",
    messages=[{
        "role": "user",
        "content": "Read this text aloud in a cheerful female voice:\nHello! How are you today?"
    }],
    max_tokens=1000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.audio:
    audio_url = base64.b64decode(response.choices[0].message.audio.data).decode()
    print(f"Generated audio: {audio_url}")
Audio Input
Python
import base64

audio_url = "https://example.com/audio.mp3"
audio_data = base64.b64encode(audio_url.encode()).decode()

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}},
                {"type": "text", "text": "What is being said?"}
            ]
        }
    ],
    max_tokens=256,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

print(response.choices[0].message.content)
Video Input
Python
response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/video.mp4"}},
                {"type": "text", "text": "Describe this video."}
            ]
        }
    ],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

print(response.choices[0].message.content)
Image to Image
Python
import json

SYSTEM_PROMPT = """You are an AI assistant that transforms images. When asked to transform, edit, or stylize an image, you MUST use the t2i_model_generation tool to generate the new image. Always respond by calling the tool."""

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
                {"type": "text", "text": "Transform to watercolor style"}
            ]
        }
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "t2i_model_generation",
            "description": "Generates an RGB image based on the provided discrete image representation.",
            "parameters": {
                "type": "object",
                "required": ["discrete_image_token"],
                "properties": {
                    "discrete_image_token": {
                        "type": "string",
                        "description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.",
                        "minLength": 1
                    }
                }
            }
        }
    }],
    max_tokens=7000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.tool_calls:
    args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    print(f"Generated image: {args['discrete_image_token']}")
Audio to Audio
Python
import base64

# Input audio (URL encoded as base64)
audio_url = "https://example.com/input.mp3"
audio_data = base64.b64encode(audio_url.encode()).decode()

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}},
                {"type": "text", "text": "Listen to this and respond with speech"}
            ]
        }
    ],
    max_tokens=2000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.audio:
    audio_url = base64.b64decode(response.choices[0].message.audio.data).decode()
    print(f"Generated audio: {audio_url}")
Using curl
Bash
# Image understanding
curl -X POST http://localhost:8000/b/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "track_b_model",
    "messages": [{"role": "user", "content": [
      {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
      {"type": "text", "text": "Describe this image."}
    ]}],
    "max_tokens": 256,
    "extra_body": {"chat_template_kwargs": {"skip_reasoning": true}}
  }'

# Text to audio
curl -X POST http://localhost:8000/b/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "track_b_model",
    "messages": [{"role": "user", "content": "Say hello"}],
    "max_tokens": 1000,
    "extra_body": {"chat_template_kwargs": {"skip_reasoning": true}}
  }'

Architecture

CSS
                         User Request
                    (Image/Audio/Video/Text)
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            OmniServe                                    │
│                  POST /b/v1/chat/completions                            │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                     [1] INPUT ENCODING                           │   │
│  │                                                                  │   │
│  │    ┌─────────────────┐               ┌─────────────────┐         │   │
│  │    │  Vision Encoder │               │  Audio Encoder  │         │   │
│  │    └────────┬────────┘               └────────┬────────┘         │   │
│  │             │                                 │                  │   │
│  │             └────────────┬────────────────────┘                  │   │
│  │                          │ embeddings                            │   │
│  └──────────────────────────┼───────────────────────────────────────┘   │
│                             ▼                                           │
│                     ┌──────────────┐                                    │
│                     │   LLM (8B)   │◀──── text                          │
│                     └──────┬───────┘                                    │
│                            │                                            │
│  ┌─────────────────────────┼────────────────────────────────────────┐   │
│  │                  [2] OUTPUT DECODING                             │   │
│  │                         │                                        │   │
│  │          ┌──────────────┼──────────────┐                         │   │
│  │          ▼              ▼              ▼                         │   │
│  │    ┌───────────┐  ┌───────────┐  ┌───────────┐                   │   │
│  │    │   Text    │  │  Vision   │  │   Audio   │                   │   │
│  │    │           │  │  Decoder  │  │  Decoder  │                   │   │
│  │    └───────────┘  └─────┬─────┘  └─────┬─────┘                   │   │
│  │                         │              │                         │   │
│  │                         ▼              ▼                         │   │
│  │                    Image URL      Audio URL                      │   │
│  │                      (S3)           (S3)                         │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                         Response
                   (Text / Image URL / Audio URL)

Hardware Requirements

ComponentGPUVRAM
Vision Encoder1x~8GB
Audio Encoder(shared)~4GB
LLM (8B)1x~16GB
Vision Decoder1x~16GB
Audio Decoder(shared)~4GB
Total3x~48GB

Key Parameters

ParameterDescriptionDefault
chat_template_kwargs.skip_reasoningSkip reasoningtrue
max_tokensMax output tokens-
temperatureSampling temperature0.7
toolsRequired for image generation-

S3 Configuration

Required for image/audio generation:

Bash
NCP_S3_ENDPOINT=https://your-s3-endpoint.com
NCP_S3_ACCESS_KEY=your-access-key
NCP_S3_SECRET_KEY=your-secret-key
NCP_S3_BUCKET_NAME=your-bucket-name

For more details, see OmniServe documentation.


Citation

TBU (Technical Report)


Questions

For any other questions, please feel free to contact us at [email protected].


License

The model is licensed under HyperCLOVA X SEED 8B Omni Model License Agreement

Liens & Ressources
Spécifications
CatégorieChat
AccèsAPI & Local
LicenceOpen Source
TarificationOpen Source
Paramètres8B parameters
Note
2.8

Essayer HyperCLOVAX SEED Omni 8B

Accédez directement au modèle