Overview

HyperCLOVA X SEED 8B Omni is a unified multimodal model that brings text, vision, and speech together, based on an auto-regressive Transformer architecture, enabling consistent multimodal understanding and generation. SEED 8B Omni aligns textual, visual, and audio representations in a shared semantic space and supports bidirectional interactions across modalities, including established text capabilities as well as vision–language QA, text-to-image generation and editing, speech recognition and translation, and text-to-speech, within a 32K context window. As an early pathfinding milestone of HyperCLOVA X toward Any-to-Any-Korean-First intelligence, SEED 8B Omni serves as a practical exploration of unified multimodal modeling and provides a reference point for future development and scaling.

Technical Report

HyperCLOVAX-SEED-Omni-8B Tech Report (PDF)

Basic Information

Architecture : Transformer-based omni-model architecture (Dense Model)
Parameters : 8B
Input Format: Text/Image/Video/Audio(Speech)
Output Format: Text/Image/Audio(Speech)
Context Length : 32K
Knowledge Cutoff: May 2025

Benchmarks

테크니컬 리포트 05_2@2x

Text-to-Text : MMLU-Pro, GSM8K, KMMLU-Pro, HAERAE 1.0
Vision-to-Text :SEED-IMG, AI2D, K-MMBench
Text-to-Vision: GenEval, ImgEdit
Audio-to-Text: Librispeech, Ksponspeech
Audio-to-Audio:Fleurs en2ko, Fleurs ko2en

Examples

Text-to-Image Generation

hf_img01

Text-based Image Editing

hf_img02 hf_img03 hf_img04

Inference

We provide OmniServe, a production-ready multimodal inference system with OpenAI-compatible API.

Capabilities

Inputs: Text, Image, Audio, Video
Outputs: Text, Image, Audio (no video generation)

Requirements

4x NVIDIA A100 80GB
Docker & Docker Compose
NVIDIA Driver 525+, CUDA 12.1+
S3-compatible storage (for image/audio output)

Installation

Bash

# Clone OmniServe
git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
cd OmniServe

# Install dependencies
pip install huggingface_hub safetensors torch openai easydict

# Download model (~16GB)
huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B \
    --local-dir ./models/HyperCLOVAX-SEED-Omni-8B

# Convert model to component format
python convert_model.py \
    --input ./models/HyperCLOVAX-SEED-Omni-8B \
    --output ./track_b \
    --track b

# Configure environment
cp .env.example .env
# Edit .env with model paths and S3 credentials

# Build and run (Track B only - OMNI model)
docker compose --profile track-b build
docker compose --profile track-b up -d

# Wait for model loading (~5 minutes)
docker compose logs -f omni

# Note: To run both VLM and OMNI models together:
# docker compose --profile track-a --profile track-b up -d

Basic Usage

Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/b/v1",
    api_key="not-needed"
)

# Image understanding
response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
                {"type": "text", "text": "What is in this image?"}
            ]
        }
    ],
    max_tokens=256,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

print(response.choices[0].message.content)

More Examples

Text to Image

Python

import json

SYSTEM_PROMPT = """You are an AI assistant that generates images. When asked to draw or create an image, you MUST use the t2i_model_generation tool to generate the image. Always respond by calling the tool."""

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "Draw a sunset over mountains"}
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "t2i_model_generation",
            "description": "Generates an RGB image based on the provided discrete image representation.",
            "parameters": {
                "type": "object",
                "required": ["discrete_image_token"],
                "properties": {
                    "discrete_image_token": {
                        "type": "string",
                        "description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.",
                        "minLength": 1
                    }
                }
            }
        }
    }],
    max_tokens=7000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.tool_calls:
    args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    print(f"Generated image: {args['discrete_image_token']}")

Text to Audio

Python

import base64

# Prompt should explicitly request speech/audio output
response = client.chat.completions.create(
    model="track_b_model",
    messages=[{
        "role": "user",
        "content": "Read this text aloud in a cheerful female voice:\nHello! How are you today?"
    }],
    max_tokens=1000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.audio:
    audio_url = base64.b64decode(response.choices[0].message.audio.data).decode()
    print(f"Generated audio: {audio_url}")

Audio Input

Python

import base64

audio_url = "https://example.com/audio.mp3"
audio_data = base64.b64encode(audio_url.encode()).decode()

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}},
                {"type": "text", "text": "What is being said?"}
            ]
        }
    ],
    max_tokens=256,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

print(response.choices[0].message.content)

Video Input

Python

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/video.mp4"}},
                {"type": "text", "text": "Describe this video."}
            ]
        }
    ],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

print(response.choices[0].message.content)

Image to Image

Python

import json

SYSTEM_PROMPT = """You are an AI assistant that transforms images. When asked to transform, edit, or stylize an image, you MUST use the t2i_model_generation tool to generate the new image. Always respond by calling the tool."""

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
                {"type": "text", "text": "Transform to watercolor style"}
            ]
        }
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "t2i_model_generation",
            "description": "Generates an RGB image based on the provided discrete image representation.",
            "parameters": {
                "type": "object",
                "required": ["discrete_image_token"],
                "properties": {
                    "discrete_image_token": {
                        "type": "string",
                        "description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.",
                        "minLength": 1
                    }
                }
            }
        }
    }],
    max_tokens=7000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.tool_calls:
    args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    print(f"Generated image: {args['discrete_image_token']}")

Audio to Audio

Python

import base64

# Input audio (URL encoded as base64)
audio_url = "https://example.com/input.mp3"
audio_data = base64.b64encode(audio_url.encode()).decode()

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}},
                {"type": "text", "text": "Listen to this and respond with speech"}
            ]
        }
    ],
    max_tokens=2000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.audio:
    audio_url = base64.b64decode(response.choices[0].message.audio.data).decode()
    print(f"Generated audio: {audio_url}")

Using curl

Bash

# Image understanding
curl -X POST http://localhost:8000/b/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "track_b_model",
    "messages": [{"role": "user", "content": [
      {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
      {"type": "text", "text": "Describe this image."}
    ]}],
    "max_tokens": 256,
    "extra_body": {"chat_template_kwargs": {"skip_reasoning": true}}
  }'

# Text to audio
curl -X POST http://localhost:8000/b/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "track_b_model",
    "messages": [{"role": "user", "content": "Say hello"}],
    "max_tokens": 1000,
    "extra_body": {"chat_template_kwargs": {"skip_reasoning": true}}
  }'

Architecture

CSS

                         User Request
                    (Image/Audio/Video/Text)
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            OmniServe                                    │
│                  POST /b/v1/chat/completions                            │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                     [1] INPUT ENCODING                           │   │
│  │                                                                  │   │
│  │    ┌─────────────────┐               ┌─────────────────┐         │   │
│  │    │  Vision Encoder │               │  Audio Encoder  │         │   │
│  │    └────────┬────────┘               └────────┬────────┘         │   │
│  │             │                                 │                  │   │
│  │             └────────────┬────────────────────┘                  │   │
│  │                          │ embeddings                            │   │
│  └──────────────────────────┼───────────────────────────────────────┘   │
│                             ▼                                           │
│                     ┌──────────────┐                                    │
│                     │   LLM (8B)   │◀──── text                          │
│                     └──────┬───────┘                                    │
│                            │                                            │
│  ┌─────────────────────────┼────────────────────────────────────────┐   │
│  │                  [2] OUTPUT DECODING                             │   │
│  │                         │                                        │   │
│  │          ┌──────────────┼──────────────┐                         │   │
│  │          ▼              ▼              ▼                         │   │
│  │    ┌───────────┐  ┌───────────┐  ┌───────────┐                   │   │
│  │    │   Text    │  │  Vision   │  │   Audio   │                   │   │
│  │    │           │  │  Decoder  │  │  Decoder  │                   │   │
│  │    └───────────┘  └─────┬─────┘  └─────┬─────┘                   │   │
│  │                         │              │                         │   │
│  │                         ▼              ▼                         │   │
│  │                    Image URL      Audio URL                      │   │
│  │                      (S3)           (S3)                         │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                         Response
                   (Text / Image URL / Audio URL)

Hardware Requirements

Component	GPU	VRAM
Vision Encoder	1x	~8GB
Audio Encoder	(shared)	~4GB
LLM (8B)	1x	~16GB
Vision Decoder	1x	~16GB
Audio Decoder	(shared)	~4GB
Total	3x	~48GB

Key Parameters

Parameter	Description	Default
`chat_template_kwargs.skip_reasoning`	Skip reasoning	`true`
`max_tokens`	Max output tokens	-
`temperature`	Sampling temperature	0.7
`tools`	Required for image generation	-

S3 Configuration

Required for image/audio generation:

Bash

NCP_S3_ENDPOINT=https://your-s3-endpoint.com
NCP_S3_ACCESS_KEY=your-access-key
NCP_S3_SECRET_KEY=your-secret-key
NCP_S3_BUCKET_NAME=your-bucket-name

For more details, see OmniServe documentation.

Citation

TBU (Technical Report)

Questions

For any other questions, please feel free to contact us at [email protected].

License

The model is licensed under HyperCLOVA X SEED 8B Omni Model License Agreement

Overview

Technical Report

HyperCLOVAX-SEED-Omni-8B Tech Report (PDF)

Basic Information

Architecture : Transformer-based omni-model architecture (Dense Model)
Parameters : 8B
Input Format: Text/Image/Video/Audio(Speech)
Output Format: Text/Image/Audio(Speech)
Context Length : 32K
Knowledge Cutoff: May 2025

Benchmarks

테크니컬 리포트 05_2@2x

Text-to-Text : MMLU-Pro, GSM8K, KMMLU-Pro, HAERAE 1.0
Vision-to-Text :SEED-IMG, AI2D, K-MMBench
Text-to-Vision: GenEval, ImgEdit
Audio-to-Text: Librispeech, Ksponspeech
Audio-to-Audio:Fleurs en2ko, Fleurs ko2en

Examples

Text-to-Image Generation

hf_img01

Text-based Image Editing

hf_img02 hf_img03 hf_img04

Inference

We provide OmniServe, a production-ready multimodal inference system with OpenAI-compatible API.

Capabilities

Inputs: Text, Image, Audio, Video
Outputs: Text, Image, Audio (no video generation)

Requirements

4x NVIDIA A100 80GB
Docker & Docker Compose
NVIDIA Driver 525+, CUDA 12.1+
S3-compatible storage (for image/audio output)

Installation

Bash

# Clone OmniServe
git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
cd OmniServe

# Install dependencies
pip install huggingface_hub safetensors torch openai easydict

# Download model (~16GB)
huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B \
    --local-dir ./models/HyperCLOVAX-SEED-Omni-8B

# Convert model to component format
python convert_model.py \
    --input ./models/HyperCLOVAX-SEED-Omni-8B \
    --output ./track_b \
    --track b

# Configure environment
cp .env.example .env
# Edit .env with model paths and S3 credentials

# Build and run (Track B only - OMNI model)
docker compose --profile track-b build
docker compose --profile track-b up -d

# Wait for model loading (~5 minutes)
docker compose logs -f omni

# Note: To run both VLM and OMNI models together:
# docker compose --profile track-a --profile track-b up -d

Basic Usage

Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/b/v1",
    api_key="not-needed"
)

# Image understanding
response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
                {"type": "text", "text": "What is in this image?"}
            ]
        }
    ],
    max_tokens=256,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

print(response.choices[0].message.content)

More Examples

Text to Image

Python

import json

SYSTEM_PROMPT = """You are an AI assistant that generates images. When asked to draw or create an image, you MUST use the t2i_model_generation tool to generate the image. Always respond by calling the tool."""

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "Draw a sunset over mountains"}
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "t2i_model_generation",
            "description": "Generates an RGB image based on the provided discrete image representation.",
            "parameters": {
                "type": "object",
                "required": ["discrete_image_token"],
                "properties": {
                    "discrete_image_token": {
                        "type": "string",
                        "description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.",
                        "minLength": 1
                    }
                }
            }
        }
    }],
    max_tokens=7000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.tool_calls:
    args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    print(f"Generated image: {args['discrete_image_token']}")

Text to Audio

Python

import base64

# Prompt should explicitly request speech/audio output
response = client.chat.completions.create(
    model="track_b_model",
    messages=[{
        "role": "user",
        "content": "Read this text aloud in a cheerful female voice:\nHello! How are you today?"
    }],
    max_tokens=1000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.audio:
    audio_url = base64.b64decode(response.choices[0].message.audio.data).decode()
    print(f"Generated audio: {audio_url}")

Audio Input

Python

import base64

audio_url = "https://example.com/audio.mp3"
audio_data = base64.b64encode(audio_url.encode()).decode()

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}},
                {"type": "text", "text": "What is being said?"}
            ]
        }
    ],
    max_tokens=256,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

print(response.choices[0].message.content)

Video Input

Python

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/video.mp4"}},
                {"type": "text", "text": "Describe this video."}
            ]
        }
    ],
    max_tokens=512,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

print(response.choices[0].message.content)

Image to Image

Python

import json

SYSTEM_PROMPT = """You are an AI assistant that transforms images. When asked to transform, edit, or stylize an image, you MUST use the t2i_model_generation tool to generate the new image. Always respond by calling the tool."""

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
                {"type": "text", "text": "Transform to watercolor style"}
            ]
        }
    ],
    tools=[{
        "type": "function",
        "function": {
            "name": "t2i_model_generation",
            "description": "Generates an RGB image based on the provided discrete image representation.",
            "parameters": {
                "type": "object",
                "required": ["discrete_image_token"],
                "properties": {
                    "discrete_image_token": {
                        "type": "string",
                        "description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.",
                        "minLength": 1
                    }
                }
            }
        }
    }],
    max_tokens=7000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.tool_calls:
    args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    print(f"Generated image: {args['discrete_image_token']}")

Audio to Audio

Python

import base64

# Input audio (URL encoded as base64)
audio_url = "https://example.com/input.mp3"
audio_data = base64.b64encode(audio_url.encode()).decode()

response = client.chat.completions.create(
    model="track_b_model",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}},
                {"type": "text", "text": "Listen to this and respond with speech"}
            ]
        }
    ],
    max_tokens=2000,
    extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)

if response.choices[0].message.audio:
    audio_url = base64.b64decode(response.choices[0].message.audio.data).decode()
    print(f"Generated audio: {audio_url}")

Using curl

Bash

# Image understanding
curl -X POST http://localhost:8000/b/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "track_b_model",
    "messages": [{"role": "user", "content": [
      {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
      {"type": "text", "text": "Describe this image."}
    ]}],
    "max_tokens": 256,
    "extra_body": {"chat_template_kwargs": {"skip_reasoning": true}}
  }'

# Text to audio
curl -X POST http://localhost:8000/b/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "track_b_model",
    "messages": [{"role": "user", "content": "Say hello"}],
    "max_tokens": 1000,
    "extra_body": {"chat_template_kwargs": {"skip_reasoning": true}}
  }'

Architecture

CSS

                         User Request
                    (Image/Audio/Video/Text)
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            OmniServe                                    │
│                  POST /b/v1/chat/completions                            │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                     [1] INPUT ENCODING                           │   │
│  │                                                                  │   │
│  │    ┌─────────────────┐               ┌─────────────────┐         │   │
│  │    │  Vision Encoder │               │  Audio Encoder  │         │   │
│  │    └────────┬────────┘               └────────┬────────┘         │   │
│  │             │                                 │                  │   │
│  │             └────────────┬────────────────────┘                  │   │
│  │                          │ embeddings                            │   │
│  └──────────────────────────┼───────────────────────────────────────┘   │
│                             ▼                                           │
│                     ┌──────────────┐                                    │
│                     │   LLM (8B)   │◀──── text                          │
│                     └──────┬───────┘                                    │
│                            │                                            │
│  ┌─────────────────────────┼────────────────────────────────────────┐   │
│  │                  [2] OUTPUT DECODING                             │   │
│  │                         │                                        │   │
│  │          ┌──────────────┼──────────────┐                         │   │
│  │          ▼              ▼              ▼                         │   │
│  │    ┌───────────┐  ┌───────────┐  ┌───────────┐                   │   │
│  │    │   Text    │  │  Vision   │  │   Audio   │                   │   │
│  │    │           │  │  Decoder  │  │  Decoder  │                   │   │
│  │    └───────────┘  └─────┬─────┘  └─────┬─────┘                   │   │
│  │                         │              │                         │   │
│  │                         ▼              ▼                         │   │
│  │                    Image URL      Audio URL                      │   │
│  │                      (S3)           (S3)                         │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                         Response
                   (Text / Image URL / Audio URL)

Hardware Requirements

Component	GPU	VRAM
Vision Encoder	1x	~8GB
Audio Encoder	(shared)	~4GB
LLM (8B)	1x	~16GB
Vision Decoder	1x	~16GB
Audio Decoder	(shared)	~4GB
Total	3x	~48GB

Key Parameters

Parameter	Description	Default
`chat_template_kwargs.skip_reasoning`	Skip reasoning	`true`
`max_tokens`	Max output tokens	-
`temperature`	Sampling temperature	0.7
`tools`	Required for image generation	-

S3 Configuration

Required for image/audio generation:

Bash

NCP_S3_ENDPOINT=https://your-s3-endpoint.com
NCP_S3_ACCESS_KEY=your-access-key
NCP_S3_SECRET_KEY=your-secret-key
NCP_S3_BUCKET_NAME=your-bucket-name

For more details, see OmniServe documentation.

Citation

TBU (Technical Report)

Questions

For any other questions, please feel free to contact us at [email protected].

License

The model is licensed under HyperCLOVA X SEED 8B Omni Model License Agreement