by Supabase
Open source · 290k downloads · 100 likes
The gte-small model is a text embedding solution designed to convert texts into accurate and usable vector representations. Developed by Alibaba's DAMO Academy, it is based on the BERT architecture and stands out for its lightweight design while delivering competitive performance. It excels in tasks such as information retrieval, evaluating semantic similarity between sentences, or re-ranking results, thanks to training on a vast corpus of relevant text pairs. Primarily intended for English texts, it handles inputs up to 512 tokens and is distinguished by its compatibility with various environments, including JavaScript through optimized weights for Transformers.js. Its efficiency and versatility make it a suitable tool for applications requiring a nuanced understanding of language.
Fork of https://huggingface.co/thenlper/gte-small with ONNX weights to be compatible with Transformers.js. See JavaScript usage.
General Text Embeddings (GTE) model.
The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer three different sizes of models, including GTE-large, GTE-base, and GTE-small. The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including information retrieval, semantic textual similarity, text reranking, etc.
Performance of GTE models were compared with other popular text embedding models on the MTEB benchmark. For more detailed comparison results, please refer to the MTEB leaderboard.
| Model Name | Model Size (GB) | Dimension | Sequence Length | Average (56) | Clustering (11) | Pair Classification (3) | Reranking (4) | Retrieval (15) | STS (10) | Summarization (1) | Classification (12) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| gte-large | 0.67 | 1024 | 512 | 63.13 | 46.84 | 85.00 | 59.13 | 52.22 | 83.35 | 31.66 | 73.33 |
| gte-base | 0.22 | 768 | 512 | 62.39 | 46.2 | 84.57 | 58.61 | 51.14 | 82.3 | 31.17 | 73.01 |
| e5-large-v2 | 1.34 | 1024 | 512 | 62.25 | 44.49 | 86.03 | 56.61 | 50.56 | 82.05 | 30.19 | 75.24 |
| e5-base-v2 | 0.44 | 768 | 512 | 61.5 | 43.80 | 85.73 | 55.91 | 50.29 | 81.05 | 30.28 | 73.84 |
| gte-small | 0.07 | 384 | 512 | 61.36 | 44.89 | 83.54 | 57.7 | 49.46 | 82.07 | 30.42 | 72.31 |
| text-embedding-ada-002 | - | 1536 | 8192 | 60.99 | 45.9 | 84.89 | 56.32 | 49.25 | 80.97 | 30.8 | 70.93 |
| e5-small-v2 | 0.13 | 384 | 512 | 59.93 | 39.92 | 84.67 | 54.32 | 49.04 | 80.39 | 31.16 | 72.94 |
| sentence-t5-xxl | 9.73 | 768 | 512 | 59.51 | 43.72 | 85.06 | 56.42 | 42.24 | 82.63 | 30.08 | 73.42 |
| all-mpnet-base-v2 | 0.44 | 768 | 514 | 57.78 | 43.69 | 83.04 | 59.36 | 43.81 | 80.28 | 27.49 | 65.07 |
| sgpt-bloom-7b1-msmarco | 28.27 | 4096 | 2048 | 57.59 | 38.93 | 81.9 | 55.65 | 48.22 | 77.74 | 33.6 | 66.19 |
| all-MiniLM-L12-v2 | 0.13 | 384 | 512 | 56.53 | 41.81 | 82.41 | 58.44 | 42.69 | 79.8 | 27.9 | 63.21 |
| all-MiniLM-L6-v2 | 0.09 | 384 | 512 | 56.26 | 42.35 | 82.37 | 58.04 | 41.95 | 78.9 | 30.81 | 63.05 |
| contriever-base-msmarco | 0.44 | 768 | 512 | 56.00 | 41.1 | 82.54 | 53.14 | 41.88 | 76.51 | 30.36 | 66.68 |
| sentence-t5-base | 0.22 | 768 | 512 | 55.27 | 40.21 | 85.18 | 53.09 | 33.63 | 81.14 | 31.39 | 69.81 |
This model can be used with both Python and JavaScript.
Use with Transformers and PyTorch:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
input_texts = [
"what is the capital of China?",
"how to implement quick sort in python?",
"Beijing",
"sorting algorithms"
]
tokenizer = AutoTokenizer.from_pretrained("Supabase/gte-small")
model = AutoModel.from_pretrained("Supabase/gte-small")
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
Use with sentence-transformers:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
sentences = ['That is a happy person', 'That is a very happy person']
model = SentenceTransformer('Supabase/gte-small')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))
This model can be used with JavaScript via Transformers.js.
Use with Deno or Supabase Edge Functions:
import { serve } from 'https://deno.land/[email protected]/http/server.ts'
import { env, pipeline } from 'https://cdn.jsdelivr.net/npm/@xenova/[email protected]'
// Configuration for Deno runtime
env.useBrowserCache = false;
env.allowLocalModels = false;
const pipe = await pipeline(
'feature-extraction',
'Supabase/gte-small',
);
serve(async (req) => {
// Extract input string from JSON body
const { input } = await req.json();
// Generate the embedding from the user input
const output = await pipe(input, {
pooling: 'mean',
normalize: true,
});
// Extract the embedding output
const embedding = Array.from(output.data);
// Return the embedding
return new Response(
JSON.stringify({ embedding }),
{ headers: { 'Content-Type': 'application/json' } }
);
});
Use within the browser (JavaScript Modules):
<script type="module">
import { pipeline } from 'https://cdn.jsdelivr.net/npm/@xenova/[email protected]';
const pipe = await pipeline(
'feature-extraction',
'Supabase/gte-small',
);
// Generate the embedding from text
const output = await pipe('Hello world', {
pooling: 'mean',
normalize: true,
});
// Extract the embedding output
const embedding = Array.from(output.data);
console.log(embedding);
</script>
Use within Node.js or a web bundler (Webpack, etc):
import { pipeline } from '@xenova/transformers';
const pipe = await pipeline(
'feature-extraction',
'Supabase/gte-small',
);
// Generate the embedding from text
const output = await pipe('Hello world', {
pooling: 'mean',
normalize: true,
});
// Extract the embedding output
const embedding = Array.from(output.data);
console.log(embedding);
This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.