by cnmoro
Open source · 42k downloads · 4 likes
This model, named *nomic embed text v2 moe distilled high quality*, is an optimized and distilled version of the *nomic-embed-text-v2-moe* model, designed to generate high-quality text embeddings. It converts texts into dense 768-dimensional vectors, finely capturing their semantics for tasks such as search, classification, or similarity comparison. Its distillation process, based on training with 23 million data triplets, enhances performance while reducing complexity, making it more accessible and efficient. Its primary use cases include information retrieval, document analysis, or comparing textual content, where its ability to produce precise and contextualized representations is a major advantage. What sets it apart is its innovative distillation method, combining the *Model2Vec* approach with training on massive datasets, ensuring a balance between performance and efficiency.
This Model2Vec model was created by using Tokenlearn, with nomic-embed-text-v2-moe as a base.
The output dimension is 768.
The evaluation in the model card, was executed using this model (distilled), not the original.
The process to create this one, was not a simple model2vec distill, this involved generating embeddings for 23M triplets (msmarco) with the original model, then training the tokenlearn model on it, with the nomic model as a base.
Load this model using model2vec library:
from model2vec import StaticModel
model = StaticModel.from_pretrained("cnmoro/nomic-embed-text-v2-moe-distilled-high-quality")
# Compute text embeddings
embeddings = model.encode(["Example sentence"])
Or using sentence-transformers library:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('cnmoro/nomic-embed-text-v2-moe-distilled-high-quality')
# Compute text embeddings
embeddings = model.encode(["Example sentence"])