EAGLE LLaMA3 Instruct 8B

par yuhuili

Open source · 87k downloads · 5 likes

1.0

(5 avis)ChatAPI & Local

À propos

EAGLE LLaMA3 Instruct 8B est un modèle d'inférence optimisé pour la génération de texte rapide et efficace, basé sur une approche innovante appelée EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency). Il accélère significativement la vitesse de génération des grands modèles de langage (LLM) tout en maintenant une qualité de sortie identique à celle des méthodes traditionnelles, grâce à une extrapolation des vecteurs contextuels de la deuxième couche supérieure. Ce modèle se distingue par ses performances exceptionnelles, offrant jusqu'à 5,6 fois plus de rapidité que les méthodes de décodage classiques, et est compatible avec diverses techniques d'optimisation matérielle et logicielle. Idéal pour les applications nécessitant des réponses rapides comme les chatbots, les assistants virtuels ou les systèmes de traitement automatique du langage, il s'adapte aussi bien aux environnements dotés de ressources limitées qu'aux infrastructures haut de gamme. Son approche certifiée et testée par des tiers en fait un choix fiable pour des déploiements performants et économiques.

Documentation

EAGLE

benchmark

EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) is a new baseline for fast decoding of Large Language Models (LLMs) with provable performance maintenance. This approach involves extrapolating the second-top-layer contextual feature vectors of LLMs, enabling a significant boost in generation efficiency.

EAGLE is:
- certified by the third-party evaluation as the fastest speculative method so far.
- achieving 2x speedup on gpt-fast.
- 3x faster than vanilla decoding (13B).
- 2x faster than Lookahead (13B).
- 1.6x faster than Medusa (13B).
  - provably maintaining the consistency with vanilla decoding in the distribution of generated texts.
  - trainable (within 1-2 days) and testable on 8x RTX 3090 GPUs. So even the GPU poor can afford it.
- combinable with other parallelled techniques such as vLLM, DeepSpeed, Mamba, FlashAttention, quantization, and hardware optimization.

EAGLE-2 uses the confidence scores from the draft model to approximate acceptance rates, dynamically adjusting the draft tree structure, which further enhances performance.

EAGLE-2 is:
- 4x faster than vanilla decoding (13B).
- 1.4x faster than EAGLE-1 (13B).

EAGLE-3 removes the feature prediction constraint in EAGLE and simulates this process during training using training-time testing. Considering that top-layer features are limited to next-token prediction, EAGLE-3 replaces them with a fusion of low-, mid-, and high-level semantic features. EAGLE-3 further improves generation speed while ensuring lossless performance.

EAGLE-3 is:
- 5.6 faster than vanilla decoding (13B).
- 1.8x faster than EAGLE-1 (13B).

demogif

Inference is conducted on 2x RTX 3090 GPUs at fp16 precision using the Vicuna 13B model.

Support

EAGLE has been merged in the following mainstream LLM serving frameworks (listed in alphabetical order).

Reference

For technical details and full experimental results, please check the paper of EAGLE, the paper of EAGLE-2, and the paper of EAGLE-3.

INI

@inproceedings{li2024eagle, 
	author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang}, 
	title = {{EAGLE}: Speculative Sampling Requires Rethinking Feature Uncertainty}, 
	booktitle = {International Conference on Machine Learning},
	year = {2024}
}
@inproceedings{li2024eagle2, 
	author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang}, 
	title = {{EAGLE-2}: Faster Inference of Language Models with Dynamic Draft Trees}, 
	booktitle = {Empirical Methods in Natural Language Processing},
	year = {2024}
}
@inproceedings{li2025eagle3,
    author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang},
    title = {{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test}, 
    booktitle = {Annual Conference on Neural Information Processing Systems},
    year = {2025}
}

Liens & Ressources

@inproceedings{li2024eagle, author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang}, title = {{EAGLE}: Speculative Sampling Requires Rethinking Feature Uncertainty}, booktitle = {International Conference on Machine Learning}, year = {2024} } @inproceedings{li2024eagle2, author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang}, title = {{EAGLE-2}: Faster Inference of Language Models with Dynamic Draft Trees}, booktitle = {Empirical Methods in Natural Language Processing}, year = {2024} } @inproceedings{li2025eagle3, author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang}, title = {{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test}, booktitle = {Annual Conference on Neural Information Processing Systems}, year = {2025} }