AI ExplorerAI Explorer
ToolsCategoriesSitesLLMsCompareAI QuizAlternativesPremium

—

AI Tools

—

Sites & Blogs

—

LLMs & Models

—

Categories

AI Explorer

Find and compare the best artificial intelligence tools for your projects.

Made within France

Explore

  • All tools
  • Sites & Blogs
  • LLMs & Models
  • Compare
  • Chatbots
  • AI Images
  • Code & Dev

Company

  • Premium
  • About
  • Contact
  • Blog

Legal

  • Legal notice
  • Privacy
  • Terms

© 2026 AI Explorer. All rights reserved.

HomeLLMsHyperCLOVAX SEED Omni 8B

HyperCLOVAX SEED Omni 8B

by naver-hyperclovax

Open source · 293k downloads · 186 likes

2.8
(186 reviews)ChatAPI & Local
About

HyperCLOVAX SEED Omni 8B is a unified, multimodal AI model that integrates text, image, and speech processing capabilities within a single architecture. It enables bidirectional interactions between these modalities, delivering advanced features such as text-to-image generation and editing, speech recognition and translation, as well as text-to-speech synthesis—all within a context window of 32,000 tokens. Designed as a pioneering step toward "Any-to-Any" intelligence centered on Korean, it excels particularly in multilingual and multimodal tasks, combining high performance with versatility. Its use cases span visual content creation, voice assistance, image analysis, and audio transcription, making it well-suited for both professional and creative environments. What sets it apart is its unified approach and semantic alignment across modalities, delivering a level of coherence and fluidity rarely seen in existing models.

Documentation

image

Overview

HyperCLOVA X SEED 8B Omni is a unified multimodal model that brings text, vision, and speech together, based on an auto-regressive Transformer architecture, enabling consistent multimodal understanding and generation. SEED 8B Omni aligns textual, visual, and audio representations in a shared semantic space and supports bidirectional interactions across modalities, including established text capabilities as well as vision–language QA, text-to-image generation and editing, speech recognition and translation, and text-to-speech, within a 32K context window. As an early pathfinding milestone of HyperCLOVA X toward Any-to-Any-Korean-First intelligence, SEED 8B Omni serves as a practical exploration of unified multimodal modeling and provides a reference point for future development and scaling.


Technical Report

  • HyperCLOVAX-SEED-Omni-8B Tech Report (PDF)

Basic Information

  • Architecture : Transformer-based omni-model architecture (Dense Model)
  • Parameters : 8B
  • Input Format: Text/Image/Video/Audio(Speech)
  • Output Format: Text/Image/Audio(Speech)
  • Context Length : 32K
  • Knowledge Cutoff: May 2025

Benchmarks

테크니컬 리포트 05_2@2x

  • Text-to-Text : MMLU-Pro, GSM8K, KMMLU-Pro, HAERAE 1.0
  • Vision-to-Text :SEED-IMG, AI2D, K-MMBench
  • Text-to-Vision: GenEval, ImgEdit
  • Audio-to-Text: Librispeech, Ksponspeech
  • Audio-to-Audio:Fleurs en2ko, Fleurs ko2en

Examples

Text-to-Image Generation

hf_img01

Text-based Image Editing

hf_img02 hf_img03 hf_img04


Inference

We provide OmniServe, a production-ready multimodal inference system with OpenAI-compatible API.

Capabilities

  • Inputs: Text, Image, Audio, Video
  • Outputs: Text, Image, Audio (no video generation)

Requirements

  • 4x NVIDIA A100 80GB
  • Docker & Docker Compose
  • NVIDIA Driver 525+, CUDA 12.1+
  • S3-compatible storage (for image/audio output)

Installation

Bash
# Clone OmniServe
git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
cd OmniServe

# Install dependencies
pip install huggingface_hub safetensors torch openai easydict

# Download model (~16GB)
huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B \
    --local-dir ./models/HyperCLOVAX-SEED-Omni-8B

# Convert model to component format
python convert_model.py \
    --input ./models/HyperCLOVAX-SEED-Omni-8B \
    --output ./track_b \
    --track b

# Configure environment
cp .env.example .env
# Edit .env with model paths and S3 credentials

# Build and run (Track B only - OMNI model)
docker compose --profile track-b build
docker compose --profile track-b up -d

# Wait for model loading (~5 minutes)
docker compose logs -f omni

# Note: To run both VLM and OMNI models together:
# docker compose --profile track-a --profile track-b up -d

Basic Usage

Python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/b/v1",
    api_key="not-needed"
)

# Image understanding
response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
                {"type": "text", "text": "What is in this image?"}
            ]
        }
    ],
    max_tokens=256,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

print(response.choices[0].message.content)

More Examples

Text to Image
Python
import json

SYSTEM_PROMPT = """You are an AI assistant that generates images. When asked to draw or create an image, you MUST use the t2i_model_generation tool to generate the image. Always respond by calling the tool."""

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "Draw a sunset over mountains"}
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "t2i_model_generation",
            "description": "Generates an RGB image based on the provided discrete image representation.",
            "parameters": {
                "type": "object",
                "required": ["discrete_image_token"],
                "properties": {
                    "discrete_image_token": {
                        "type": "string",
                        "description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.",
                        "minLength": 1
                    }
                }
            }
        }
    }],
    max_tokens=7000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.tool_calls:
    args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    print(f"Generated image: {args['discrete_image_token']}")
Text to Audio
Python
import base64

# Prompt should explicitly request speech/audio output
response = client.chat.completions.create(
    model="track_b_model",
    messages=[{
        "role": "user",
        "content": "Read this text aloud in a cheerful female voice:\nHello! How are you today?"
    }],
    max_tokens=1000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.audio:
    audio_url = base64.b64decode(response.choices[0].message.audio.data).decode()
    print(f"Generated audio: {audio_url}")
Audio Input
Python
import base64

audio_url = "https://example.com/audio.mp3"
audio_data = base64.b64encode(audio_url.encode()).decode()

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}},
                {"type": "text", "text": "What is being said?"}
            ]
        }
    ],
    max_tokens=256,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

print(response.choices[0].message.content)
Video Input
Python
response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/video.mp4"}},
                {"type": "text", "text": "Describe this video."}
            ]
        }
    ],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

print(response.choices[0].message.content)
Image to Image
Python
import json

SYSTEM_PROMPT = """You are an AI assistant that transforms images. When asked to transform, edit, or stylize an image, you MUST use the t2i_model_generation tool to generate the new image. Always respond by calling the tool."""

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
                {"type": "text", "text": "Transform to watercolor style"}
            ]
        }
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "t2i_model_generation",
            "description": "Generates an RGB image based on the provided discrete image representation.",
            "parameters": {
                "type": "object",
                "required": ["discrete_image_token"],
                "properties": {
                    "discrete_image_token": {
                        "type": "string",
                        "description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.",
                        "minLength": 1
                    }
                }
            }
        }
    }],
    max_tokens=7000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.tool_calls:
    args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    print(f"Generated image: {args['discrete_image_token']}")
Audio to Audio
Python
import base64

# Input audio (URL encoded as base64)
audio_url = "https://example.com/input.mp3"
audio_data = base64.b64encode(audio_url.encode()).decode()

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}},
                {"type": "text", "text": "Listen to this and respond with speech"}
            ]
        }
    ],
    max_tokens=2000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.audio:
    audio_url = base64.b64decode(response.choices[0].message.audio.data).decode()
    print(f"Generated audio: {audio_url}")
Using curl
Bash
# Image understanding
curl -X POST http://localhost:8000/b/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "track_b_model",
    "messages": [{"role": "user", "content": [
      {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
      {"type": "text", "text": "Describe this image."}
    ]}],
    "max_tokens": 256,
    "extra_body": {"chat_template_kwargs": {"skip_reasoning": true}}
  }'

# Text to audio
curl -X POST http://localhost:8000/b/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "track_b_model",
    "messages": [{"role": "user", "content": "Say hello"}],
    "max_tokens": 1000,
    "extra_body": {"chat_template_kwargs": {"skip_reasoning": true}}
  }'

Architecture

CSS
                         User Request
                    (Image/Audio/Video/Text)
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            OmniServe                                    │
│                  POST /b/v1/chat/completions                            │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                     [1] INPUT ENCODING                           │   │
│  │                                                                  │   │
│  │    ┌─────────────────┐               ┌─────────────────┐         │   │
│  │    │  Vision Encoder │               │  Audio Encoder  │         │   │
│  │    └────────┬────────┘               └────────┬────────┘         │   │
│  │             │                                 │                  │   │
│  │             └────────────┬────────────────────┘                  │   │
│  │                          │ embeddings                            │   │
│  └──────────────────────────┼───────────────────────────────────────┘   │
│                             ▼                                           │
│                     ┌──────────────┐                                    │
│                     │   LLM (8B)   │◀──── text                          │
│                     └──────┬───────┘                                    │
│                            │                                            │
│  ┌─────────────────────────┼────────────────────────────────────────┐   │
│  │                  [2] OUTPUT DECODING                             │   │
│  │                         │                                        │   │
│  │          ┌──────────────┼──────────────┐                         │   │
│  │          ▼              ▼              ▼                         │   │
│  │    ┌───────────┐  ┌───────────┐  ┌───────────┐                   │   │
│  │    │   Text    │  │  Vision   │  │   Audio   │                   │   │
│  │    │           │  │  Decoder  │  │  Decoder  │                   │   │
│  │    └───────────┘  └─────┬─────┘  └─────┬─────┘                   │   │
│  │                         │              │                         │   │
│  │                         ▼              ▼                         │   │
│  │                    Image URL      Audio URL                      │   │
│  │                      (S3)           (S3)                         │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                         Response
                   (Text / Image URL / Audio URL)

Hardware Requirements

ComponentGPUVRAM
Vision Encoder1x~8GB
Audio Encoder(shared)~4GB
LLM (8B)1x~16GB
Vision Decoder1x~16GB
Audio Decoder(shared)~4GB
Total3x~48GB

Key Parameters

ParameterDescriptionDefault
chat_template_kwargs.skip_reasoningSkip reasoningtrue
max_tokensMax output tokens-
temperatureSampling temperature0.7
toolsRequired for image generation-

S3 Configuration

Required for image/audio generation:

Bash
NCP_S3_ENDPOINT=https://your-s3-endpoint.com
NCP_S3_ACCESS_KEY=your-access-key
NCP_S3_SECRET_KEY=your-secret-key
NCP_S3_BUCKET_NAME=your-bucket-name

For more details, see OmniServe documentation.


Citation

TBU (Technical Report)


Questions

For any other questions, please feel free to contact us at [email protected].


License

The model is licensed under HyperCLOVA X SEED 8B Omni Model License Agreement

Capabilities & Tags
transformersdiffuserssafetensorsvlmtext-generationconversationalcustom_codeendpoints_compatible
Links & Resources
Specifications
CategoryChat
AccessAPI & Local
LicenseOpen Source
PricingOpen Source
Parameters8B parameters
Rating
2.8

Try HyperCLOVAX SEED Omni 8B

Access the model directly