by microsoft
Open source · 99k downloads · 970 likes
The Phi-3 Vision 128K Instruct model is a lightweight yet powerful multimodal solution designed to process both text and images with high precision. With an extended context length of 128,000 tokens, it excels in tasks requiring deep understanding of complex visual documents, such as OCR, chart or table analysis, and general image comprehension. Optimized for memory- and compute-constrained environments, it is particularly well-suited for applications where low latency and efficiency are critical, while maintaining strict adherence to instructions and enhanced security measures. Ideal for commercial or research use in English, it stands out for its versatility and ability to serve as a solid foundation for generative AI features that integrate visual data.
🎉 Phi-3.5: [mini-instruct]; [MoE-instruct] ; [vision-instruct]
The Phi-3-Vision-128K-Instruct is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.
Resources and Technical Documentation:
| Short Context | Long Context | |
|---|---|---|
| Mini | 4K [HF] ; [ONNX] ; [GGUF] | 128K [HF] ; [ONNX] |
| Small | 8K [HF] ; [ONNX] | 128K [HF] ; [ONNX] |
| Medium | 4K [HF] ; [ONNX] | 128K [HF] ; [ONNX] |
| Vision | 128K [HF] ; [ONNX] |
Primary use cases
The model is intended for broad commercial and research use in English. The model provides uses for general purpose AI systems and applications with visual and text input capabilities which require
Our model is designed to accelerate research on efficient language and multimodal models, for use as a building block for generative AI powered features.
Use case considerations
Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.
Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.
Phi-3-Vision-128K-Instruct has been integrated in the development version (4.40.2) of transformers. Until the official version is released through pip, ensure that you are doing one of the following:
When loading the model, ensure that trust_remote_code=True is passed as an argument of the from_pretrained() function.
Update your local transformers to the development version: pip uninstall -y transformers && pip install git+https://github.com/huggingface/transformers. The previous command is an alternative to cloning and installing from the source.
The current transformers version can be verified with: pip list | grep transformers.
Examples of required packages:
flash_attn==2.5.8
numpy==1.24.4
Pillow==10.3.0
Requests==2.31.0
torch==2.3.0
torchvision==0.18.0
transformers==4.40.2
Phi-3-Vision-128K-Instruct is also available in Azure AI Studio.
Given the nature of the training data, the Phi-3-Vision-128K-Instruct model is best suited for a single image input wih prompts using the chat format as follows. You can provide the prompt as a single image with a generic template as follow:
<|user|>\n<|image_1|>\n{prompt}<|end|>\n<|assistant|>\n
where the model generates the text after <|assistant|> . In case of multi-turn conversation, the prompt can be formatted as follows:
<|user|>\n<|image_1|>\n{prompt_1}<|end|>\n<|assistant|>\n{response_1}<|end|>\n<|user|>\n{prompt_2}<|end|>\n<|assistant|>\n
This code snippets show how to get quickly started with running the model on a GPU:
from PIL import Image
import requests
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
model_id = "microsoft/Phi-3-vision-128k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto", _attn_implementation='flash_attention_2') # use _attn_implementation='eager' to disable flash attention
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
messages = [
{"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
{"role": "assistant", "content": "The chart displays the percentage of respondents who agree with various statements about their preparedness for meetings. It shows five categories: 'Having clear and pre-defined goals for meetings', 'Knowing where to find the information I need for a meeting', 'Understanding my exact role and responsibilities when I'm invited', 'Having tools to manage admin tasks like note-taking or summarization', and 'Having more focus time to sufficiently prepare for meetings'. Each category has an associated bar indicating the level of agreement, measured on a scale from 0% to 100%."},
{"role": "user", "content": "Provide insightful questions to spark discussion."}
]
url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png"
image = Image.open(requests.get(url, stream=True).raw)
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")
generation_args = {
"max_new_tokens": 500,
"temperature": 0.0,
"do_sample": False,
}
generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)
# remove input tokens
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)
Additional basic examples are provided here.
We recommend user to take a look at the Phi-3 CookBook finetuning recipe for Vision
Like other models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
Developers should apply responsible AI best practices and are responsible for ensuring that a specific use case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Important areas for consideration include:
Our training data includes a wide variety of sources, and is a combination of
The data collection process involved sourcing information from publicly available documents, with a meticulous approach to filtering out undesirable documents and images. To safeguard privacy, we carefully filtered various image and text data sources to remove or scrub any potentially personal data from the training data.
More details can be found in the Phi-3 Technical Report.
To understand the capabilities, we compare Phi-3-Vision-128K-Instruct with a set of models over a variety of zero-shot benchmarks using our internal benchmark platform.
| Benchmark | Phi-3 Vision-128K-In | LlaVA-1.6 Vicuna-7B | QWEN-VL Chat | Llama3-Llava-Next-8B | Claude-3 Haiku | Gemini 1.0 Pro V | GPT-4V-Turbo |
|---|---|---|---|---|---|---|---|
| MMMU | 40.4 | 34.2 | 39.0 | 36.4 | 40.7 | 42.0 | 55.5 |
| MMBench | 80.5 | 76.3 | 75.8 | 79.4 | 62.4 | 80.0 | 86.1 |
| ScienceQA | 90.8 | 70.6 | 67.2 | 73.7 | 72.0 | 79.7 | 75.7 |
| MathVista | 44.5 | 31.5 | 29.4 | 34.8 | 33.2 | 35.0 | 47.5 |
| InterGPS | 38.1 | 20.5 | 22.3 | 24.6 | 32.1 | 28.6 | 41.0 |
| AI2D | 76.7 | 63.1 | 59.8 | 66.9 | 60.3 | 62.8 | 74.7 |
| ChartQA | 81.4 | 55.0 | 50.9 | 65.8 | 59.3 | 58.0 | 62.3 |
| TextVQA | 70.9 | 64.6 | 59.4 | 55.7 | 62.7 | 64.7 | 68.1 |
| POPE | 85.8 | 87.2 | 82.6 | 87.0 | 74.4 | 84.2 | 83.7 |
Note that by default, the Phi-3-Vision-128K model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:
The model is licensed under the MIT license.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/blob/main/data_summary_card.md