par naver-hyperclovax
Open source · 293k downloads · 186 likes
HyperCLOVAX SEED Omni 8B est un modèle d'IA unifié et multimodal qui fusionne les capacités de traitement du texte, des images et de la parole au sein d'une seule architecture. Il permet des interactions bidirectionnelles entre ces modalités, offrant des fonctionnalités avancées comme la génération et l'édition d'images à partir de texte, la reconnaissance et la traduction vocale, ainsi que la synthèse vocale, le tout dans une fenêtre de contexte de 32 000 tokens. Conçu comme une étape pionnière vers une intelligence "Any-to-Any" centrée sur le coréen, il excelle particulièrement dans les tâches multilingues et multimodales, combinant performance et polyvalence. Ses cas d'usage couvrent la création de contenu visuel, l'assistance vocale, l'analyse d'images ou encore la transcription audio, le rendant adapté aux environnements professionnels et créatifs. Ce qui le distingue, c'est son approche unifiée et son alignement sémantique entre les différentes modalités, offrant une cohérence et une fluidité rares dans les modèles existants.

HyperCLOVA X SEED 8B Omni is a unified multimodal model that brings text, vision, and speech together, based on an auto-regressive Transformer architecture, enabling consistent multimodal understanding and generation. SEED 8B Omni aligns textual, visual, and audio representations in a shared semantic space and supports bidirectional interactions across modalities, including established text capabilities as well as vision–language QA, text-to-image generation and editing, speech recognition and translation, and text-to-speech, within a 32K context window. As an early pathfinding milestone of HyperCLOVA X toward Any-to-Any-Korean-First intelligence, SEED 8B Omni serves as a practical exploration of unified multimodal modeling and provides a reference point for future development and scaling.



We provide OmniServe, a production-ready multimodal inference system with OpenAI-compatible API.
# Clone OmniServe
git clone https://github.com/NAVER-Cloud-HyperCLOVA-X/OmniServe.git
cd OmniServe
# Install dependencies
pip install huggingface_hub safetensors torch openai easydict
# Download model (~16GB)
huggingface-cli download naver-hyperclovax/HyperCLOVAX-SEED-Omni-8B \
--local-dir ./models/HyperCLOVAX-SEED-Omni-8B
# Convert model to component format
python convert_model.py \
--input ./models/HyperCLOVAX-SEED-Omni-8B \
--output ./track_b \
--track b
# Configure environment
cp .env.example .env
# Edit .env with model paths and S3 credentials
# Build and run (Track B only - OMNI model)
docker compose --profile track-b build
docker compose --profile track-b up -d
# Wait for model loading (~5 minutes)
docker compose logs -f omni
# Note: To run both VLM and OMNI models together:
# docker compose --profile track-a --profile track-b up -d
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/b/v1",
api_key="not-needed"
)
# Image understanding
response = client.chat.completions.create(
model="track_b_model",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "What is in this image?"}
]
}
],
max_tokens=256,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
print(response.choices[0].message.content)
import json
SYSTEM_PROMPT = """You are an AI assistant that generates images. When asked to draw or create an image, you MUST use the t2i_model_generation tool to generate the image. Always respond by calling the tool."""
response = client.chat.completions.create(
model="track_b_model",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "Draw a sunset over mountains"}
],
tools=[{
"type": "function",
"function": {
"name": "t2i_model_generation",
"description": "Generates an RGB image based on the provided discrete image representation.",
"parameters": {
"type": "object",
"required": ["discrete_image_token"],
"properties": {
"discrete_image_token": {
"type": "string",
"description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.",
"minLength": 1
}
}
}
}
}],
max_tokens=7000,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
if response.choices[0].message.tool_calls:
args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
print(f"Generated image: {args['discrete_image_token']}")
import base64
# Prompt should explicitly request speech/audio output
response = client.chat.completions.create(
model="track_b_model",
messages=[{
"role": "user",
"content": "Read this text aloud in a cheerful female voice:\nHello! How are you today?"
}],
max_tokens=1000,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
if response.choices[0].message.audio:
audio_url = base64.b64decode(response.choices[0].message.audio.data).decode()
print(f"Generated audio: {audio_url}")
import base64
audio_url = "https://example.com/audio.mp3"
audio_data = base64.b64encode(audio_url.encode()).decode()
response = client.chat.completions.create(
model="track_b_model",
messages=[
{
"role": "user",
"content": [
{"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}},
{"type": "text", "text": "What is being said?"}
]
}
],
max_tokens=256,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
print(response.choices[0].message.content)
response = client.chat.completions.create(
model="track_b_model",
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/video.mp4"}},
{"type": "text", "text": "Describe this video."}
]
}
],
max_tokens=512,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
print(response.choices[0].message.content)
import json
SYSTEM_PROMPT = """You are an AI assistant that transforms images. When asked to transform, edit, or stylize an image, you MUST use the t2i_model_generation tool to generate the new image. Always respond by calling the tool."""
response = client.chat.completions.create(
model="track_b_model",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
{"type": "text", "text": "Transform to watercolor style"}
]
}
],
tools=[{
"type": "function",
"function": {
"name": "t2i_model_generation",
"description": "Generates an RGB image based on the provided discrete image representation.",
"parameters": {
"type": "object",
"required": ["discrete_image_token"],
"properties": {
"discrete_image_token": {
"type": "string",
"description": "A serialized string of discrete vision tokens, encapsulated by special tokens. The format must be strictly followed: <|discrete_image_start|><|vision_ratio_4:3|><|vision_token|><|visionaaaaa|><|visionbbbbb|>... <|visionzzzzz|><|vision_eol|><|vision_eof|><|discrete_image_end|>.",
"minLength": 1
}
}
}
}
}],
max_tokens=7000,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
if response.choices[0].message.tool_calls:
args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
print(f"Generated image: {args['discrete_image_token']}")
import base64
# Input audio (URL encoded as base64)
audio_url = "https://example.com/input.mp3"
audio_data = base64.b64encode(audio_url.encode()).decode()
response = client.chat.completions.create(
model="track_b_model",
messages=[
{
"role": "user",
"content": [
{"type": "input_audio", "input_audio": {"data": audio_data, "format": "mp3"}},
{"type": "text", "text": "Listen to this and respond with speech"}
]
}
],
max_tokens=2000,
extra_body={"chat_template_kwargs": {"skip_reasoning": True}}
)
if response.choices[0].message.audio:
audio_url = base64.b64decode(response.choices[0].message.audio.data).decode()
print(f"Generated audio: {audio_url}")
# Image understanding
curl -X POST http://localhost:8000/b/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "track_b_model",
"messages": [{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "Describe this image."}
]}],
"max_tokens": 256,
"extra_body": {"chat_template_kwargs": {"skip_reasoning": true}}
}'
# Text to audio
curl -X POST http://localhost:8000/b/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "track_b_model",
"messages": [{"role": "user", "content": "Say hello"}],
"max_tokens": 1000,
"extra_body": {"chat_template_kwargs": {"skip_reasoning": true}}
}'
User Request
(Image/Audio/Video/Text)
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ OmniServe │
│ POST /b/v1/chat/completions │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ [1] INPUT ENCODING │ │
│ │ │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │ Vision Encoder │ │ Audio Encoder │ │ │
│ │ └────────┬────────┘ └────────┬────────┘ │ │
│ │ │ │ │ │
│ │ └────────────┬────────────────────┘ │ │
│ │ │ embeddings │ │
│ └──────────────────────────┼───────────────────────────────────────┘ │
│ ▼ │
│ ┌──────────────┐ │
│ │ LLM (8B) │◀──── text │
│ └──────┬───────┘ │
│ │ │
│ ┌─────────────────────────┼────────────────────────────────────────┐ │
│ │ [2] OUTPUT DECODING │ │
│ │ │ │ │
│ │ ┌──────────────┼──────────────┐ │ │
│ │ ▼ ▼ ▼ │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ Text │ │ Vision │ │ Audio │ │ │
│ │ │ │ │ Decoder │ │ Decoder │ │ │
│ │ └───────────┘ └─────┬─────┘ └─────┬─────┘ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ Image URL Audio URL │ │
│ │ (S3) (S3) │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
Response
(Text / Image URL / Audio URL)
| Component | GPU | VRAM |
|---|---|---|
| Vision Encoder | 1x | ~8GB |
| Audio Encoder | (shared) | ~4GB |
| LLM (8B) | 1x | ~16GB |
| Vision Decoder | 1x | ~16GB |
| Audio Decoder | (shared) | ~4GB |
| Total | 3x | ~48GB |
| Parameter | Description | Default |
|---|---|---|
chat_template_kwargs.skip_reasoning | Skip reasoning | true |
max_tokens | Max output tokens | - |
temperature | Sampling temperature | 0.7 |
tools | Required for image generation | - |
Required for image/audio generation:
NCP_S3_ENDPOINT=https://your-s3-endpoint.com
NCP_S3_ACCESS_KEY=your-access-key
NCP_S3_SECRET_KEY=your-secret-key
NCP_S3_BUCKET_NAME=your-bucket-name
For more details, see OmniServe documentation.
TBU (Technical Report)
For any other questions, please feel free to contact us at [email protected].
The model is licensed under HyperCLOVA X SEED 8B Omni Model License Agreement