par perplexity-ai
Open source · 586k downloads · 204 likes
Le modèle pplx embed v1 0.6b est un outil avancé conçu pour générer des embeddings textuels denses et contextuels, optimisés pour des tâches de recherche sémantique à grande échelle. Il se décline en deux versions : une pour des embeddings indépendants (idéal pour les requêtes ou documents) et une autre pour des chunks de documents dans des systèmes RAG, où le contexte environnant est crucial. Contrairement à d’autres modèles qui nécessitent des instructions préfixées, il permet d’embarquer directement le texte souhaité, simplifiant ainsi les pipelines d’indexation et évitant les variations d’embeddings liées aux changements de prompts. Ses embeddings, non normalisés et quantifiés en int8, doivent être comparés via la similarité cosinus pour garantir des résultats précis. Ce modèle se distingue par sa robustesse et son efficacité, offrant une solution performante pour des applications comme la recherche documentaire, la classification de texte ou l’enrichissement de bases de connaissances.
pplx-embed-v1: Diffusion-Pretrained Dense and Contextual Embeddings
pplx-embed-v1 and pplx-embed-context-v1 are state-of-the-art text embedding models optimized for real-world, web-scale retrieval tasks.
pplx-embed-v1 for independent text embedding (queries, documents, semantic search)pplx-embed-context-v1 for document chunks in RAG systems where surrounding context matters[!IMPORTANT]
pplx-embed-v1andpplx-embed-context-v1natively produce unnormalized int8-quantized embeddings. Ensure that you compare them via cosine similarity.

| Model | Dimensions | Context | MRL | Quantization | Instruction | Pooling |
|---|---|---|---|---|---|---|
pplx-embed-v1-0.6B | 1024 | 32K | Yes | INT8/BINARY | No | Mean |
pplx-embed-v1-4B | 2560 | 32K | Yes | INT8/BINARY | No | Mean |
pplx-embed-context-v1-0.6B | 1024 | 32K | Yes | INT8/BINARY | No | Mean |
pplx-embed-context-v1-4B | 2560 | 32K | Yes | INT8/BINARY | No | Mean |
All models are built on diffusion continued pre-trained Qwen3 at Perplexity AI.
Many modern embedding models rely on instruction tuning, where users prepend an instruction string to the text being embedded. This can yield a 2%-3% lift on benchmarks, but it also introduces prompt-selection overhead and can make indexing pipelines brittle (small instruction changes can shift embedding space). We deliberately avoid this requirement: you can embed the text you want to index directly, without having to choose or maintain an instruction prefix.
curl -X POST https://api.perplexity.ai/v1/embeddings \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"input": [
"Scientists explore the universe driven by curiosity.",
"Children learn through curious exploration.",
"Historical discoveries began with curious questions.",
"Animals use curiosity to adapt and survive.",
"Philosophy examines the nature of curiosity."
],
"model": "pplx-embed-v1-0.6b"
}'
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"perplexity-ai/pplx-embed-v1-0.6B",
trust_remote_code=True
)
texts = [
"Scientists explore the universe driven by curiosity.",
"Children learn through curious exploration.",
"Historical discoveries began with curious questions.",
"Animals use curiosity to adapt and survive.",
"Philosophy examines the nature of curiosity.",
]
embeddings = model.encode(texts) # Shape: (5, 1024), quantized to int8
embeddings = model.encode(texts, quantization="binary") # Shape: (5, 1024), quantized to binary
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
tokenizer = AutoTokenizer.from_pretrained("perplexity-ai/pplx-embed-v1-0.6b", trust_remote_code=True)
session = ort.InferenceSession("onnx/model.onnx")
texts = [
"Scientists explore the universe driven by curiosity.",
"Children learn through curious exploration.",
"Historical discoveries began with curious questions.",
"Animals use curiosity to adapt and survive.",
"Philosophy examines the nature of curiosity.",
]
tokenized = tokenizer(
texts,
padding=True,
truncation=True,
return_tensors="np"
)
onnx_inputs = {
"input_ids": tokenized["input_ids"].astype(np.int64),
"attention_mask": tokenized["attention_mask"].astype(np.int64),
}
# Run inference
onnx_embeddings = session.run([out.name for out in session.get_outputs()], onnx_inputs)
# ONNX produces both int8 and binary precision embeddings:
int8_embeddings = onnx_embeddings[2]
binary_embeddings = onnx_embeddings[3]
packed_embeddings = np.packbits(binary_embeddings != -1, axis=-1)
[!NOTE] Text Embeddings Inference v1.9.2+ is required.
[!IMPORTANT] Currently, only int8-quantized embeddings are available via TEI. Remember to use cosine similarity with unnormalized int8 embeddings.
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 --model-id perplexity-ai/pplx-embed-v1-0.6B --dtype float32
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 --model-id onnx-community/pplx-embed-v1-0.6B --dtype float32
docker run --gpus all --shm-size 1g -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id perplexity-ai/pplx-embed-v1-0.6B --dtype float32
If you hit OOM during warmup, lower --max-batch-tokens and --max-client-batch-size. Set --max-batch-tokens to max_sequence_length × batch_size (e.g., 2048 tokens × 8 sequences = 16384).
Alternatively, when running in CUDA you can use the architecture / compute capability specific container instead of the
cuda-1.9, as that includes the binaries for Turing, Ampere, Hopper and Blackwell, so using a dedicated container will be lighter e.g.,ampere-1.9.
And then you can send requests to it via cURL to /embed:
curl http://0.0.0.0:8080/embed \
-H "Content-Type: application/json" \
-d '{
"inputs": [
"Scientists explore the universe driven by curiosity.",
"Children learn through curious exploration.",
"Historical discoveries began with curious questions.",
"Animals use curiosity to adapt and survive.",
"Philosophy examines the nature of curiosity."
],
"normalize": false
}'
For comprehensive technical details and evaluation results, see our paper on arXiv: https://arxiv.org/abs/2602.11151.