by naver-hyperclovax
Open source · 293k downloads · 186 likes
HyperCLOVAX SEED Omni 8B is a unified, multimodal AI model that integrates text, image, and speech processing capabilities within a single architecture. It enables bidirectional interactions between these modalities, delivering advanced features such as text-to-image generation and editing, speech recognition and translation, as well as text-to-speech synthesis—all within a context window of 32,000 tokens. Designed as a pioneering step toward "Any-to-Any" intelligence centered on Korean, it excels particularly in multilingual and multimodal tasks, combining high performance with versatility. Its use cases span visual content creation, voice assistance, image analysis, and audio transcription, making it well-suited for both professional and creative environments. What sets it apart is its unified approach and semantic alignment across modalities, delivering a level of coherence and fluidity rarely seen in existing models.

HyperCLOVA X SEED 8B Omni is a unified multimodal model that brings text, vision, and speech together, based on an auto-regressive Transformer architecture, enabling consistent multimodal understanding and generation. SEED 8B Omni aligns textual, visual, and audio representations in a shared semantic space and supports bidirectional interactions across modalities, including established text capabilities as well as vision–language QA, text-to-image generation and editing, speech recognition and translation, and text-to-speech, within a 32K context window. As an early pathfinding milestone of HyperCLOVA X toward Any-to-Any-Korean-First intelligence, SEED 8B Omni serves as a practical exploration of unified multimodal modeling and provides a reference point for future development and scaling.



We provide OmniServe, a production-ready multimodal inference system with OpenAI-compatible API.
# Clone OmniServe
git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
cd OmniServe
# Install dependencies
pip install huggingface_hub safetensors torch openai easydict
# Download model (~16GB)
huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B \
--local-dir ./models/HyperCLOVAX-SEED-Omni-8B
# Convert model to component format
python convert_model.py \
--input ./models/HyperCLOVAX-SEED-Omni-8B \
--output ./track_b \
--track b
# Configure environment
cp .env.example .env
# Edit .env with model paths and S3 credentials
# Build and run (Track B only - OMNI model)
docker compose --profile track-b build
docker compose --profile track-b up -d
# Wait for model loading (~5 minutes)
docker compose logs -f omni
# Note: To run both VLM and OMNI models together:
# docker compose --profile track-a --profile track-b up -d
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/b/v1",
api_key="not-needed"
)
# Image understanding
response = client.chat.completions.create(
model="track_b_model",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "What is in this image?"}
]
}
],
max_tokens=256,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
print(response.choices[0].message.content)
import json
SYSTEM_PROMPT = """You are an AI assistant that generates images. When asked to draw or create an image, you MUST use the t2i_model_generation tool to generate the image. Always respond by calling the tool."""
response = client.chat.completions.create(
model="track_b_model",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "Draw a sunset over mountains"}
],
tools=[{
"type": "function",
"function": {
"name": "t2i_model_generation",
"description": "Generates an RGB image based on the provided discrete image representation.",
"parameters": {
"type": "object",
"required": ["discrete_image_token"],
"properties": {
"discrete_image_token": {
"type": "string",
"description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.",
"minLength": 1
}
}
}
}
}],
max_tokens=7000,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
if response.choices[0].message.tool_calls:
args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
print(f"Generated image: {args['discrete_image_token']}")
import base64
# Prompt should explicitly request speech/audio output
response = client.chat.completions.create(
model="track_b_model",
messages=[{
"role": "user",
"content": "Read this text aloud in a cheerful female voice:\nHello! How are you today?"
}],
max_tokens=1000,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
if response.choices[0].message.audio:
audio_url = base64.b64decode(response.choices[0].message.audio.data).decode()
print(f"Generated audio: {audio_url}")
import base64
audio_url = "https://example.com/audio.mp3"
audio_data = base64.b64encode(audio_url.encode()).decode()
response = client.chat.completions.create(
model="track_b_model",
messages=[
{
"role": "user",
"content": [
{"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}},
{"type": "text", "text": "What is being said?"}
]
}
],
max_tokens=256,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
print(response.choices[0].message.content)
response = client.chat.completions.create(
model="track_b_model",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/video.mp4"}},
{"type": "text", "text": "Describe this video."}
]
}
],
max_tokens=512,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
print(response.choices[0].message.content)
import json
SYSTEM_PROMPT = """You are an AI assistant that transforms images. When asked to transform, edit, or stylize an image, you MUST use the t2i_model_generation tool to generate the new image. Always respond by calling the tool."""
response = client.chat.completions.create(
model="track_b_model",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
{"type": "text", "text": "Transform to watercolor style"}
]
}
],
tools=[{
"type": "function",
"function": {
"name": "t2i_model_generation",
"description": "Generates an RGB image based on the provided discrete image representation.",
"parameters": {
"type": "object",
"required": ["discrete_image_token"],
"properties": {
"discrete_image_token": {
"type": "string",
"description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.",
"minLength": 1
}
}
}
}
}],
max_tokens=7000,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
if response.choices[0].message.tool_calls:
args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
print(f"Generated image: {args['discrete_image_token']}")
import base64
# Input audio (URL encoded as base64)
audio_url = "https://example.com/input.mp3"
audio_data = base64.b64encode(audio_url.encode()).decode()
response = client.chat.completions.create(
model="track_b_model",
messages=[
{
"role": "user",
"content": [
{"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}},
{"type": "text", "text": "Listen to this and respond with speech"}
]
}
],
max_tokens=2000,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
if response.choices[0].message.audio:
audio_url = base64.b64decode(response.choices[0].message.audio.data).decode()
print(f"Generated audio: {audio_url}")
# Image understanding
curl -X POST http://localhost:8000/b/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "track_b_model",
"messages": [{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "Describe this image."}
]}],
"max_tokens": 256,
"extra_body": {"chat_template_kwargs": {"skip_reasoning": true}}
}'
# Text to audio
curl -X POST http://localhost:8000/b/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "track_b_model",
"messages": [{"role": "user", "content": "Say hello"}],
"max_tokens": 1000,
"extra_body": {"chat_template_kwargs": {"skip_reasoning": true}}
}'
User Request
(Image/Audio/Video/Text)
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ OmniServe │
│ POST /b/v1/chat/completions │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ [1] INPUT ENCODING │ │
│ │ │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │ Vision Encoder │ │ Audio Encoder │ │ │
│ │ └────────┬────────┘ └────────┬────────┘ │ │
│ │ │ │ │ │
│ │ └────────────┬────────────────────┘ │ │
│ │ │ embeddings │ │
│ └──────────────────────────┼───────────────────────────────────────┘ │
│ ▼ │
│ ┌──────────────┐ │
│ │ LLM (8B) │◀──── text │
│ └──────┬───────┘ │
│ │ │
│ ┌─────────────────────────┼────────────────────────────────────────┐ │
│ │ [2] OUTPUT DECODING │ │
│ │ │ │ │
│ │ ┌──────────────┼──────────────┐ │ │
│ │ ▼ ▼ ▼ │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ Text │ │ Vision │ │ Audio │ │ │
│ │ │ │ │ Decoder │ │ Decoder │ │ │
│ │ └───────────┘ └─────┬─────┘ └─────┬─────┘ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ Image URL Audio URL │ │
│ │ (S3) (S3) │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
Response
(Text / Image URL / Audio URL)
| Component | GPU | VRAM |
|---|---|---|
| Vision Encoder | 1x | ~8GB |
| Audio Encoder | (shared) | ~4GB |
| LLM (8B) | 1x | ~16GB |
| Vision Decoder | 1x | ~16GB |
| Audio Decoder | (shared) | ~4GB |
| Total | 3x | ~48GB |
| Parameter | Description | Default |
|---|---|---|
chat_template_kwargs.skip_reasoning | Skip reasoning | true |
max_tokens | Max output tokens | - |
temperature | Sampling temperature | 0.7 |
tools | Required for image generation | - |
Required for image/audio generation:
NCP_S3_ENDPOINT=https://your-s3-endpoint.com
NCP_S3_ACCESS_KEY=your-access-key
NCP_S3_SECRET_KEY=your-secret-key
NCP_S3_BUCKET_NAME=your-bucket-name
For more details, see OmniServe documentation.
TBU (Technical Report)
For any other questions, please feel free to contact us at [email protected].
The model is licensed under HyperCLOVA X SEED 8B Omni Model License Agreement