Meeting the hardware demand for cloud capacity, AI services

As developers and businesses explore the potential for Microsoft Cloud and AI capabilities, demand for those services continues to rise. Ensuring that the right infrastructure is in place at the right locations at the right time and with enough capacity to meet those needs—today and in the future—is a tall order. Our cloud supply chain team is delivering on that challenge, implementing a range of strategies to predict and deliver capacity while minimizing the risk of supply chain disruptions.

The process starts with a demand forecast. Using data science, our cloud supply chain team forecasts cloud customer demand based on millions of customer demand signals, and then plans hardware inventory based on those forecasts. We engage suppliers around the world to source hardware components, manufacture and deliver the components to the system integrators that make the hardware, ship that hardware to our more than 300 datacenters, and sustainably manage hardware end of life.

Microsoft Datacenters

Get the latest news about Microsoft datacenters

Creating faster responses and better results

A key focus of our supply chain team is streamlining operations so we can bring new server capacity live faster in response to demand, giving us more flexibility to meet customer needs.  

In the last four years, we have reduced that time by more than 80%, while also significantly increasing the number of servers delivered to our datacenters. In the 2023 fiscal year alone, we delivered 46% more server racks than two years before. We have also implemented a range of strategies that have reduced risk for our supply chain, including:

  • Transitioning from almost 100% of server manufacturing taking place in China and Taiwan to less than 50% in that region, lessening dependence on a single area of the world.
  • Working with suppliers to reduce reliance on any single country or region for hardware components, further decreasing risk exposure.
  • Focusing on sourcing hardware components from multiple suppliers to increase agility and decrease risk, with almost 95% of hardware components now multi-sourced.
  • Working closely with the Microsoft hardware engineering team to ensure that supply chain considerations (like location of suppliers and the ability to source components from multiple suppliers) are a focus early in the cloud hardware product design process.

These changes were critical as we faced both a huge spike in demand and multiple supply chain threats—including chip shortages, global shipping issues, shutdowns, and other challenges.

Long-term strategies for meeting customer needs

A key factor in meeting ever-increasing demand is having more control of our supply chain. Seven years ago, we began a shift toward in-house server design and last November, we announced that we are designing our own CPUs and GPUs that complement our partnerships with silicon providers like AMD, Intel, and NVIDIA. In-house design and manufacturing of hardware and silicon enables our engineers to design servers and technologies specifically for our cloud and AI workloads, and also gives our cloud supply chain team more levers they can use to affect resiliency, capacity, and cost.

In the last year, customer demand for AI services and capabilities has grown rapidly.  Our supply chain team has been focused on strategies to ensure we can meet today’s demand as well as projected demand in years to come.

For example, decisions made in the design phase of our GPU hardware can affect the availability of components as well as the cost. Our supply chain worked with our hardware design team to ensure we could enable multiple GPU technologies, which provides greater flexibility, and to move from niche technologies to mainstream components that are more widely available.

A man working in Microsoft's datacenter

Expanding the supply chain

As we begin production of our own silicon in 2024—the Azure Maia AI accelerator series and the Azure Cobalt CPU series—our cloud supply chain team is making that rollout possible. In a short timeframe, we established an entirely new supply chain for GPUs and CPUs. In addition to the complexities of working with a range of new suppliers, we established end-to-end processes, from planning through the repair or replacement of defective parts.

The custom silicon not only gives us better control over design and detail, it also enables our supply chain to work with more suppliers in the value chain on key elements of the production process. This could result in more opportunities to refine processes and increase efficiency in the production of servers as our silicon production scales.

Innovative approaches

We deliver cloud capacity and AI services to our customers through a globally distributed infrastructure designed to bring applications closer to users, preserve data residency, and offer comprehensive compliance and resiliency options.

However, there are many variables. We have hundreds of hardware variations actively in use at any given time. In addition, every delivery of server racks to our datacenters is complex. Each datacenter has specific power requirements, space constraints, and delivery specifications. The mode of transport used for each leg of the delivery affects timing, cost, and emissions, all of which are critical factors for our supply chain planners. For example, transporting hardware on the ocean results in much lower emissions than transporting via plane, but it takes longer.

Faster insights through machine learning, large language models

The cloud supply chain team uses machine learning to not only forecast capacity demand, but to constantly monitor demand and supply and make recommendations to ensure the right hardware goes to the right datacenter at the right time to meet customer needs.

In a six-month time span last year, this automated fulfillment planning system optimized more than 1.5 million possibilities daily for 50% of our rack deliveries globally. We are expanding this program with a goal of including 95% of rack deliveries by the end of 2024.

Currently, we are exploring large-language model solutions in our operations. One pilot is designed to help manage the complexities of meeting cloud demand worldwide, enabling planners to quickly get answers to critical questions and then evaluate and execute alternative solutions. For example, a planner can ask the system for the reasons behind a delivery delay. In the past, getting answers to these kinds of questions could take 2.5 days and involve the engineering team to investigate the reasons behind the system’s indication of a delay. Now, planners can get the information they need in just a few minutes.

Aerial view of Microsoft's datacenter

Always streamlining, always improving

Our innovations and the progress we’ve made are being recognized in the supply chain industry. In 2023, Microsoft was ranked #7 on the annual Gartner Supply Chain Top 25, which identifies the world’s leading supply chain organizations, highlights trends, and shares best practices. While this is an honor, it’s also given us many opportunities to share what we’ve learned with supply chain leaders in other companies and to learn from their success.

As businesses around the world transform their operations with the cloud and new technologies like generative AI, we’re excited to support them on their journey with agile and resilient infrastructure.

Learn More

Cloud and AI are changing the way we work—and live. Explore the global datacenters that power Microsoft cloud and AI services:

The post Meeting the hardware demand for cloud capacity, AI services appeared first on Microsoft Azure Blog.