AI/EXPLORER
ToolsCategoriesSitesLLMsCompareAI QuizAlternativesPremium
—AI Tools
—Sites & Blogs
—LLMs & Models
—Categories
AI Explorer

Find and compare the best artificial intelligence tools for your projects.

Made within France

Explore

  • ›All tools
  • ›Sites & Blogs
  • ›LLMs & Models
  • ›Compare
  • ›Chatbots
  • ›AI Images
  • ›Code & Dev

Company

  • ›Premium
  • ›About
  • ›Contact
  • ›Blog

Legal

  • ›Legal notice
  • ›Privacy
  • ›Terms

© 2026 AI Explorer·All rights reserved.

HomeLLMspplx embed v1 0.6b

pplx embed v1 0.6b

by perplexity-ai

Open source · 601k downloads · 204 likes

2.9
(204 reviews)EmbeddingAPI & Local
About

The pplx embed v1 0.6b model is an advanced tool designed to generate dense, contextual text embeddings optimized for large-scale semantic search tasks. It comes in two versions: one for standalone embeddings (ideal for queries or documents) and another for document chunks in RAG systems, where surrounding context is critical. Unlike other models that require prefixed instructions, it allows direct embedding of the desired text, simplifying indexing pipelines and avoiding embedding variations caused by prompt changes. Its embeddings, unnormalized and quantized in int8, must be compared using cosine similarity to ensure accurate results. The model stands out for its robustness and efficiency, providing a high-performance solution for applications such as document retrieval, text classification, or knowledge base enrichment.

Documentation

Perplexity Logo

pplx-embed-v1: Diffusion-Pretrained Dense and Contextual Embeddings

pplx-embed-v1 and pplx-embed-context-v1 are state-of-the-art text embedding models optimized for real-world, web-scale retrieval tasks.

  • Use pplx-embed-v1 for independent text embedding (queries, documents, semantic search)
  • Use pplx-embed-context-v1 for document chunks in RAG systems where surrounding context matters

[!IMPORTANT] pplx-embed-v1 and pplx-embed-context-v1 natively produce unnormalized int8-quantized embeddings. Ensure that you compare them via cosine similarity.

diag.png

Models

ModelDimensionsContextMRLQuantizationInstructionPooling
pplx-embed-v1-0.6B102432KYesINT8/BINARYNoMean
pplx-embed-v1-4B256032KYesINT8/BINARYNoMean
pplx-embed-context-v1-0.6B102432KYesINT8/BINARYNoMean
pplx-embed-context-v1-4B256032KYesINT8/BINARYNoMean

All models are built on diffusion continued pre-trained Qwen3 at Perplexity AI.

Many modern embedding models rely on instruction tuning, where users prepend an instruction string to the text being embedded. This can yield a 2%-3% lift on benchmarks, but it also introduces prompt-selection overhead and can make indexing pipelines brittle (small instruction changes can shift embedding space). We deliberately avoid this requirement: you can embed the text you want to index directly, without having to choose or maintain an instruction prefix.

Usage

Via API
Bash
curl -X POST https://api.perplexity.ai/v1/embeddings \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": [
      "Scientists explore the universe driven by curiosity.",
      "Children learn through curious exploration.",
      "Historical discoveries began with curious questions.",
      "Animals use curiosity to adapt and survive.",
      "Philosophy examines the nature of curiosity."
    ],
    "model": "pplx-embed-v1-0.6b"
  }'
Using SentenceTransformers
Python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "perplexity-ai/pplx-embed-v1-0.6B",
    trust_remote_code=True
)

texts = [
    "Scientists explore the universe driven by curiosity.",
    "Children learn through curious exploration.",
    "Historical discoveries began with curious questions.",
    "Animals use curiosity to adapt and survive.",
    "Philosophy examines the nature of curiosity.",
]

embeddings = model.encode(texts) # Shape: (5, 1024), quantized to int8
embeddings = model.encode(texts, quantization="binary") # Shape: (5, 1024), quantized to binary
Using ONNX models
Python

import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("perplexity-ai/pplx-embed-v1-0.6b", trust_remote_code=True)
session = ort.InferenceSession("onnx/model.onnx")


texts = [
    "Scientists explore the universe driven by curiosity.",
    "Children learn through curious exploration.",
    "Historical discoveries began with curious questions.",
    "Animals use curiosity to adapt and survive.",
    "Philosophy examines the nature of curiosity.",
]

tokenized = tokenizer(
    texts,
    padding=True,
    truncation=True,
    return_tensors="np"
)

onnx_inputs = {
    "input_ids": tokenized["input_ids"].astype(np.int64),
    "attention_mask": tokenized["attention_mask"].astype(np.int64),
}

# Run inference
onnx_embeddings = session.run([out.name for out in session.get_outputs()], onnx_inputs)

# ONNX produces both int8 and binary precision embeddings:
int8_embeddings = onnx_embeddings[2]
binary_embeddings = onnx_embeddings[3]
packed_embeddings = np.packbits(binary_embeddings != -1, axis=-1)
Using Text Embeddings Inference (TEI)

[!NOTE] Text Embeddings Inference v1.9.2+ is required.

[!IMPORTANT] Currently, only int8-quantized embeddings are available via TEI. Remember to use cosine similarity with unnormalized int8 embeddings.

  • CPU w/ Candle:
Bash
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 --model-id perplexity-ai/pplx-embed-v1-0.6B --dtype float32
  • CPU w/ ORT (ONNX Runtime):
Bash
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 --model-id onnx-community/pplx-embed-v1-0.6B --dtype float32
  • GPU w/ CUDA:
Bash
docker run --gpus all --shm-size 1g -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 --model-id perplexity-ai/pplx-embed-v1-0.6B --dtype float32

If you hit OOM during warmup, lower --max-batch-tokens and --max-client-batch-size. Set --max-batch-tokens to max_sequence_length × batch_size (e.g., 2048 tokens × 8 sequences = 16384).

Alternatively, when running in CUDA you can use the architecture / compute capability specific container instead of the cuda-1.9, as that includes the binaries for Turing, Ampere, Hopper and Blackwell, so using a dedicated container will be lighter e.g., ampere-1.9.

And then you can send requests to it via cURL to /embed:

Bash
curl http://0.0.0.0:8080/embed \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      "Scientists explore the universe driven by curiosity.",
      "Children learn through curious exploration.",
      "Historical discoveries began with curious questions.",
      "Animals use curiosity to adapt and survive.",
      "Philosophy examines the nature of curiosity."
    ],
    "normalize": false
  }'

Technical Details

For comprehensive technical details and evaluation results, see our paper on arXiv: https://arxiv.org/abs/2602.11151.

Capabilities & Tags
sentence-transformersonnxsafetensorsbidirectional_pplx_qwen3feature-extractionsentence-similaritymtebcustom_codemultilingualtext-embeddings-inference
Links & Resources
Specifications
CategoryEmbedding
AccessAPI & Local
LicenseOpen Source
PricingOpen Source
Parameters6B parameters
Rating
2.9

Try pplx embed v1 0.6b

Access the model directly