Developers can grab already quantized versions of Phi-3 mini (with variants for the 4k and 128k versions). They can now also get Phi 3 medium (4k and 128k) and Mistral v0.2. Stay tuned for additional pre-quantized models! We’ve also shipped a gradio interface to make easier to test these models with the new ONNX Runtime Generate() API. Learn more.
Be sure to check out our Build sessions to learn more. See below for details.
See here to learn what our hardware vendor partners have to say:
- AMD: https://community.amd.com/t5/ai/reduce-memory-footprint-and-improve-performance-running-llms-on/ba-p/686157
- Intel: https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Accelerating-Language-Models-Intel-and-Microsoft-Collaborate-to/post/1598013
- NVIDIA: https://blogs.nvidia.com/blog/microsoft-build-optimized-ai-developers
What is quantization?
Memory bandwidth is often a bottleneck for getting models to run on entry-level and older hardware, especially when it comes to language models. This means that making language models smaller directly translates to increasing the breadth of devices developers can target.
There’s been a lot of research into reducing model size through quantization, a process that reduces the precision and therefore size of model weights.
Our goal is to ensure scalability, while also maintaining model accuracy, so we integrated support for models that have had Activation-Aware Quantization (AWQ) applied to them. AWQ is a technique that lets us reap the memory savings from quantization with only a minimal impact on accuracy. AWQ achieves this by identifying the top 1% of salient weights that are needed for maintaining model accuracy and then quantizes the remaining 99% of weights. This leads to much less accuracy loss with AWQ compared to other techniques.
The average person reads up to 5 words/second. Thanks to the significant memory wins from AWQ, Phi-3-mini runs at this speed or faster on older discrete GPUs and even laptop integrated GPUs. This translates into being able to run Phi-3-mini on hundreds of millions of devices!
Check out our Build talk below to see this in action!
Perplexity measurements
Perplexity is a measure used to quantify how well a model predicts a sample. Without getting into the math of it all, a lower perplexity score means the model is more certain about its predictions and suggests that the model’s probability distribution is closer to the true distribution of the data.
Perplexity can be thought of as a way to quantify the average number of branches in front of a model at each decision point. At each step, a lower perplexity would mean that the model has fewer, more confident choices to make, which reflects a more refined understanding of the topic. A higher perplexity would mean more, less confident choices and therefore choices that are less predictable, relevant, and/or varied in quality.
As you can see below our data shows that AWQ leads to a small loss in model accuracy with only a small increase in perplexity. In return, using AWQ means 4x smaller model weights, leading to a dramatic increase in the number of devices that can run Phi-3-mini!
Model variant | Dataset | Base model perplexity | AWQ perplexity | Difference |
Phi3 mini 128k | wikitext2 | 14.42 | 14.81 | 0.39 |
Phi3 mini 128k | ptb | 31.39 | 33.63 | 2.24 |
Phi3 mini 4k | wikitext2 | 15.83 | 16.52 | 0.69 |
Phi3 mini 4k | ptb | 31.98 | 34.3 | 2.32 |
Learn more
Be sure check out the these sessions at Build to learn more:
- BRK240: Bring AI experiences to all your Windows Devices
- BRK247: Create Generative AI experiences using Phi
- LAB371: Test Drive AI on Windows with DirectML, ONNX Runtime, and Olive
Get Started
Check out the ONNX Runtime Generate() API repo to get started today: https://github.com/microsoft/onnxruntime-genai
See here for our chat app with a handy gradio interface: https://github.com/microsoft/onnxruntime-genai/tree/main/examples/chat_app
This lets developers choose from different types of language models that work best for their specific use case. Stay tuned for more!
Drivers
We recommend upgrading to the latest drivers for the best performance.
- AMD: improved driver acceleration for generative AI including large language models (AMD Software: Adrenalin Edition 23.40.27.06 for DirectML)
- Intel is excited to partner with Microsoft and provide a driver optimized for these AWQ scenarios across a wide range of hardware – please download our publicly available WHQL certified driver with full support today, available here
- NVIDIA: R555 Game Ready, Studio or NVIDIA RTX Enterprise