par microsoft
Open source · 99k downloads · 970 likes
Le modèle Phi-3 Vision 128K Instruct est une solution multimodale légère et performante, conçue pour traiter à la fois du texte et des images avec une grande précision. Grâce à une longueur de contexte étendue à 128 000 tokens, il excelle dans les tâches nécessitant une compréhension approfondie de documents visuels complexes, comme l'OCR, l'analyse de graphiques ou de tableaux, ainsi que la compréhension générale d'images. Optimisé pour des environnements contraints en mémoire ou en calcul, il convient particulièrement aux applications où la latence et l'efficacité sont cruciales, tout en garantissant une adhérence stricte aux instructions et des mesures de sécurité renforcées. Idéal pour des usages commerciaux ou de recherche en anglais, il se distingue par sa polyvalence et sa capacité à servir de base solide pour des fonctionnalités d'IA générative intégrant des données visuelles.
🎉 Phi-3.5: [mini-instruct]; [MoE-instruct] ; [vision-instruct]
The Phi-3-Vision-128K-Instruct is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.
Resources and Technical Documentation:
| Short Context | Long Context | |
|---|---|---|
| Mini | 4K [HF] ; [ONNX] ; [GGUF] | 128K [HF] ; [ONNX] |
| Small | 8K [HF] ; [ONNX] | 128K [HF] ; [ONNX] |
| Medium | 4K [HF] ; [ONNX] | 128K [HF] ; [ONNX] |
| Vision | 128K [HF] ; [ONNX] |
Primary use cases
The model is intended for broad commercial and research use in English. The model provides uses for general purpose AI systems and applications with visual and text input capabilities which require
Our model is designed to accelerate research on efficient language and multimodal models, for use as a building block for generative AI powered features.
Use case considerations
Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.
Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.
Phi-3-Vision-128K-Instruct has been integrated in the development version (4.40.2) of transformers. Until the official version is released through pip, ensure that you are doing one of the following:
When loading the model, ensure that trust_remote_code=True is passed as an argument of the from_pretrained() function.
Update your local transformers to the development version: pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers. The previous command is an alternative to cloning and installing from the source.
The current transformers version can be verified with: pip list | grep transformers.
Examples of required packages:
flash_attn==2.5.8
numpy==1.24.4
Pillow==10.3.0
Requests==2.31.0
torch==2.3.0
torchvision==0.18.0
transformers==4.40.2
Phi-3-Vision-128K-Instruct is also available in Azure AI Studio.
Given the nature of the training data, the Phi-3-Vision-128K-Instruct model is best suited for a single image input wih prompts using the chat format as follows. You can provide the prompt as a single image with a generic template as follow:
<|user|>\n<|image_1|>\n{prompt}<|end|>\n<|assistant|>\n
where the model generates the text after <|assistant|> . In case of multi-turn conversation, the prompt can be formatted as follows:
<|user|>\n<|image_1|>\n{prompt_1}<|end|>\n<|assistant|>\n{response_1}<|end|>\n<|user|>\n{prompt_2}<|end|>\n<|assistant|>\n
This code snippets show how to get quickly started with running the model on a GPU:
from PIL import Image
import requests
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
model_id = "microsoft/Phi-3-vision-128k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto", _attn_implementation='flash_attention_2') # use _attn_implementation='eager' to disable flash attention
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
messages = [
{"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
{"role": "assistant", "content": "The chart displays the percentage of respondents who agree with various statements about their preparedness for meetings. It shows five categories: 'Having clear and pre-defined goals for meetings', 'Knowing where to find the information I need for a meeting', 'Understanding my exact role and responsibilities when I'm invited', 'Having tools to manage admin tasks like note-taking or summarization', and 'Having more focus time to sufficiently prepare for meetings'. Each category has an associated bar indicating the level of agreement, measured on a scale from 0% to 100%."},
{"role": "user", "content": "Provide insightful questions to spark discussion."}
]
url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png"
image = Image.open(requests.get(url, stream=True).raw)
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")
generation_args = {
"max_new_tokens": 500,
"temperature": 0.0,
"do_sample": False,
}
generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)
# remove input tokens
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)
Additional basic examples are provided here.
We recommend user to take a look at the Phi-3 CookBook finetuning recipe for Vision
Like other models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Important areas for consideration include:
Our training data includes a wide variety of sources, and is a combination of
The data collection process involved sourcing information from publicly available documents, with a meticulous approach to filtering out undesirable documents and images. To safeguard privacy, we carefully filtered various image and text data sources to remove or scrub any potentially personal data from the training data.
More details can be found in the Phi-3 Technical Report.
To understand the capabilities, we compare Phi-3-Vision-128K-Instruct with a set of models over a variety of zero-shot benchmarks using our internal benchmark platform.
| Benchmark | Phi-3 Vision-128K-In | LlaVA-1.6 Vicuna-7B | QWEN-VL Chat | Llama3-Llava-Next-8B | Claude-3 Haiku | Gemini 1.0 Pro V | GPT-4V-Turbo |
|---|---|---|---|---|---|---|---|
| MMMU | 40.4 | 34.2 | 39.0 | 36.4 | 40.7 | 42.0 | 55.5 |
| MMBench | 80.5 | 76.3 | 75.8 | 79.4 | 62.4 | 80.0 | 86.1 |
| ScienceQA | 90.8 | 70.6 | 67.2 | 73.7 | 72.0 | 79.7 | 75.7 |
| MathVista | 44.5 | 31.5 | 29.4 | 34.8 | 33.2 | 35.0 | 47.5 |
| InterGPS | 38.1 | 20.5 | 22.3 | 24.6 | 32.1 | 28.6 | 41.0 |
| AI2D | 76.7 | 63.1 | 59.8 | 66.9 | 60.3 | 62.8 | 74.7 |
| ChartQA | 81.4 | 55.0 | 50.9 | 65.8 | 59.3 | 58.0 | 62.3 |
| TextVQA | 70.9 | 64.6 | 59.4 | 55.7 | 62.7 | 64.7 | 68.1 |
| POPE | 85.8 | 87.2 | 82.6 | 87.0 | 74.4 | 84.2 | 83.7 |
Note that by default, the Phi-3-Vision-128K model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:
The model is licensed under the MIT license.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/blob/main/data_summary_card.md