News, Uncategorized, Windows Blog

Available today: DeepSeek R1 7B & 14B distilled models for Copilot+ PCs via Azure AI Foundry – further expanding AI on the edge

Posted on 3 March, 2025 by [email protected]

At Microsoft, we believe the future of AI is happening now — spanning from the cloud to the edge. Our vision is bold: to build Windows as the ultimate platform for AI innovation, where intelligence isn’t just in the cloud but seamlessly woven throughout the system, silicon and hardware at the edge. Building on our recent announcement of bringing NPU-optimized versions of DeepSeek-R1 1.5B distilled model directly to Copilot+ PCs, we’re taking the next step forward with the availability of DeepSeek R1 7B & 14B distilled models for Copilot+ PCs via Azure AI Foundry. This milestone reinforces our commitment to delivering cutting-edge AI capabilities that are fast, efficient and built for real-world applications — helping developers, businesses and creators push the boundaries of what’s possible.

Availability starts with Copilot+ PCs powered by Qualcomm Snapdragon X, followed by Intel Core Ultra 200V and AMD Ryzen.

The ability to run 7B and 14B parameter reasoning models on Neural Processing Units (NPUs) is a significant milestone in the democratization and accessibility of artificial intelligence. This progression allows researchers, developers and enthusiasts to leverage the substantial power and functionalities of large-scale machine learning models directly from their Copilot+ PCs. These Copilot+ PCs include an NPU capable of over 40 trillion operations per second (TOPS).

NPUs are purpose-built to run AI models locally on-device with exceptional efficiency

NPUs like those built into Copilot+ PCs are purpose-built to run AI models with exceptional efficiency, balancing speed and power consumption. They ensure sustained AI compute with minimal impact on battery life, thermal performance and resource usage. This leaves CPUs and GPUs free to perform other tasks, allowing reasoning models to operate longer and deliver superior results — all while keeping your PC running smoothly.

Efficient inferencing has heightened significance due to a new scaling law for language models, which indicates that chain of thought reasoning during inference can improve response quality across various tasks. The longer a model can “think,” the better its quality will be. Instead of increasing parameters or training data, this approach taps into additional computational power for better outcomes. DeepSeek distilled models exemplify how even small pretrained models can shine with enhanced reasoning capabilities and when coupled with the NPUs on Copilot+ PCs, they unlock exciting new opportunities for innovation.

Reasoning emerges in models of a certain minimum scale, and models at that scale must think using a large number of tokens to excel at complex multi-step reasoning. Although the NPU hardware aids in reducing inference costs, it is equally important to maintain a manageable memory footprint for these models on consumer PCs, say with 16GB RAM.

Pushing the boundaries of what’s possible on Windows

Our research investments have enabled us to push the boundaries of what’s possible on Windows even further at the system level and at a model level leading to innovations like Phi Silica. With our work on Phi Silica we were able to create a scalable platform for low-bit inference on NPUs, enabling powerful performance with minimal memory and bandwidth tax. Combined with the data privacy offered by local compute, this puts advanced scenarios like Retrieval Augmented Generation (RAG) and model fine-tuning at the fingertips of application developers.

We reused techniques such as QuaRot, sliding window for fast first token responses and many other optimizations to enable the DeepSeek 1.5B release. We used Aqua, an internal automatic quantization tool, to quantize all the DeepSeek model variants to int4 weights with QuaRot, while retaining most of the accuracy. Using the same toolchain we used to optimize Phi Silica we quickly integrated all the optimizations into an efficient ONNX QdQ model with low precision weights.

Like the 1.5B model, the 7B and 14B variants use 4-bit block wise quantization for the embeddings and language model head and run these memory-access heavy operations on the CPU. The compute-heavy transformer block containing the context processing and token iteration uses int4 per-channel quantization for the weights alongside int16 activations. We already see about 8 tok/sec on the 14B model (the 1.5B model, being very small, demonstrated close to 40 tok/sec) — and further optimizations are coming in as we leverage more advanced techniques. With all this in place, these nimble language models think longer and harder.

This durable path to innovation has made it possible for us to more quickly optimize larger variants of DeepSeek models (7B and 14B) and will continue to enable us to bring more new models to run on Windows efficiently.

Get started today

Developers can access all distilled variants (1.5B, 7B and 14B) of DeepSeek models and run them on Copilot+ PCs by simply downloading the AI Toolkit VS Code extension. The DeepSeek model optimized in the ONNX QDQ format is available in AI Toolkit’s model catalog, pulled directly from Azure AI Foundry. You can download it locally by clicking the “Download” button. Once downloaded, experimenting with the model is as simple as opening the Playground, loading the “deepseek_r1_1_5” model and sending it prompts.

Run models across Copilot+ PCs and Azure

Copilot+ PCs offer local compute capabilities that are an extension of capabilities enabled by Azure, giving developers even more flexibility to train, fine-tune small language models on-device and leverage the cloud for larger intensive workloads. In addition to the ONNX model optimized for Copilot+ PC, you can also try the cloud-hosted source model in Azure Foundry by clicking on the “Try in Playground” button under “DeepSeek R1.” AI Toolkit is part of your developer workflow as you experiment with models and get them ready for deployment. With this playground, you can effortlessly test the DeepSeek models available in Azure AI Foundry for local deployment too. Through this, developers now have access to the most complete set of DeepSeek models available through the Azure AI Foundry from cloud to client.

Copilot+ PCs pair efficient compute with the near infinite compute Microsoft has to offer via its Azure services. With reasoning able to span the cloud and the edge, running in sustained loops on the PC and invoking the much larger brains in the cloud as needed — we are on to a new paradigm of continuous compute creating value for our customers. The future of AI compute just got brighter! We can’t wait to see the new innovations from our developer community taking advantage of these rich capabilities. Please keep the feedback coming!