par nvidia
Open source · 563k downloads · 34 likes
Le modèle Llama 3.1 8B Instruct FP8 est une version optimisée et quantifiée du modèle Llama 3.1 8B Instruct, spécialement conçue pour l'inférence efficace. Il s'agit d'un modèle de langage auto-régressif basé sur une architecture transformer, capable de comprendre et de générer du texte de manière fluide et contextuelle. Grâce à sa quantification en FP8, il réduit de moitié l'espace de stockage et les besoins en mémoire GPU tout en offrant des performances accrues, avec un gain de vitesse d'environ 1,3 fois sur des GPU NVIDIA H100. Ses principales capacités incluent la génération de texte, la compréhension de langage, le raisonnement logique et la réponse à des questions complexes, tout en maintenant une haute précision sur des tâches variées. Ce modèle est particulièrement adapté aux déploiements commerciaux ou non commerciaux nécessitant une grande efficacité énergétique et des performances optimisées. Il se distingue par sa compatibilité avec des frameworks comme TensorRT-LLM et vLLM, ainsi que par son support des longues séquences jusqu'à 128 000 tokens, ce qui le rend idéal pour des applications comme les chatbots, l'analyse de documents ou l'assistance automatisée.
The NVIDIA Llama 3.1 8B Instruct FP8 model is the quantized version of the Meta's Llama 3.1 8B Instruct model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check here. The NVIDIA Llama 3.1 8B Instruct FP8 model is quantized with TensorRT Model Optimizer.
This model is ready for commercial and non-commercial use.
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA (Meta-Llama-3.1-8B-Instruct) Model Card.
Architecture Type: Transformers
Network Architecture: Llama3.1
Input Type(s): Text
Input Format(s): String
Input Parameters: Sequences
Other Properties Related to Input: Context length up to 128K
Output Type(s): Text
Output Format: String
Output Parameters: Sequences
Other Properties Related to Output: N/A
Supported Runtime Engine(s):
Supported Hardware Microarchitecture Compatibility:
Preferred Operating System(s):
The model is quantized with nvidia-modelopt v0.27.0
Engine: Tensor(RT)-LLM or vLLM
Test Hardware: H100
This model was obtained by quantizing the weights and activations of Meta-Llama-3.1-8B-Instruct to FP8 data type, ready for inference with TensorRT-LLM and vLLM. Only the weights and activations of the linear operators within transformers blocks are quantized. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. On H100, we achieved 1.3x speedup.
To deploy the quantized checkpoint with TensorRT-LLM, follow the sample commands below with the TensorRT-LLM GitHub repo:
python examples/llama/convert_checkpoint.py --model_dir Llama-3.1-8B-Instruct-FP8 --output_dir /ckpt --use_fp8
trtllm-build --checkpoint_dir /ckpt --output_dir /engine
Please refer to the TensorRT-LLM benchmarking documentation for details.
| Precision | MMLU | GSM8K (CoT) | ARC Challenge | IFEVAL | TPS |
| BF16 | 69.4 | 84.5 | 83.4 | 80.4 | 8,579.93 |
| FP8 | 68.7 | 83.1 | 83.3 | 81.8 | 11,062.90 |
We benchmarked with tensorrt-llm v0.13 on 8 H100 GPUs, using batch size 1024 for the throughputs with in-flight batching enabled. We achieved ~1.3x speedup with FP8.
To deploy the quantized checkpoint with vLLM, follow the instructions below:
quantization=modelopt flag must be passed into the config while initializing the LLM Engine.Example deployment on H100:
from vllm import LLM, SamplingParams
model_id = "nvidia/Llama-3.1-8B-Instruct-FP8"
sampling_params = SamplingParams(temperature=0.8, top_p=0.9)
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
llm = LLM(model=model_id, quantization="modelopt")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
This model can be deployed with an OpenAI Compatible Server via the vLLM backend. Instructions here.