Accelerating industry-wide innovations in datacenter infrastructure and security

To furnish the cloud infrastructure necessary to deliver in the era of AI, rapid technological transformation has never been more crucial than it is today. To deliver for our customers while moving innovation forward, we can learn from technological shifts of the past and see the critical role of community-led innovation and industry standardization. For the past decade, Microsoft has driven this kind of deep collaboration through cross-industry organizations like Open Compute Project (OCP). As a result, we continue to advance hardware innovation at every layer of the computing stack from server and rack architecture, networking and storage, and reliability, availability, and serviceability (RAS) designs to new supply chain assessment frameworks that ensure security,1 sustainability,2 and reliability3 across the cloud value chain.

As we continue to innovate in the era of AI, we are excited to return to the OCP Global Summit this year with more contributions to support ecosystem innovation from new power and cooling solutions that address the changing profile of AI datacenters to new hardware security frameworks that put trust and resiliency at the core of our infrastructure for accelerated computing.

Evolving datacenter cooling with modular systems designed for global deployability

As AI demands grow, we are reimagining our datacenters with a focus on increasing rack density and enhancing cooling efficiency. Last fall, when we announced the Azure Maia 100 system, we also introduced a dedicated liquid cooling “sidekick”, a closed-loop design that uses recirculated fluid to reduce heat. We’ve continued down the path of cooling innovation since then, working with partners to develop new datacenter cooling techniques that can solve for growing AI power profiles while addressing ease of deployability. We’re pleased to be contributing the designs for an advanced liquid cooling heat exchanger unit to OCP so that the whole community can benefit from learnings in liquid cooling and keep the pace of innovation to accommodate rapidly evolving AI systems. For more information, read the Tech Community blog.

Disaggregated power architectures for next-generation systems

The evolution of AI systems has also driven increased power densities in hyperscale datacenters. As these systems grow, we have uncovered new opportunities for flexibility and modularity in system design. While compute and storage systems for cloud typically have power density below 20 kW, AI systems has driven power densities to hundreds of kW. We are solving the increased power infrastructure demands in the age of AI with Mt. Diablo, our latest collaboration with Meta. This is a new disaggregated rack design to address critical space and power constraints. The solution features a disaggregated 400 High Voltage Direct Current (VDC) unit that scales from hundreds of kW up to 1MW, enabling 15% to 35% more AI accelerators in each server rack. This modular approach allows for power adjustments in the disaggregated power rack to meet the changing demands of different inferencing and training SKUs. We are excited to continue our engineering collaboration with Meta on this contribution to the OCP community. Read the Tech Community blog to learn more.

Advancing a secure AI future with new confidential computing solutions

Last month, Microsoft detailed our vision for Trustworthy AI and Azure Confidential Inferencing, where security is rooted in hardware-based Trusted Execution Environments (TEEs) and transparency of the Confidential Trust Boundary. Today, we expand on this vision with new open-source silicon innovation of the Adams Bridge quantum resilient accelerator and its integration into Caliptra 2.0, the next generation open-source silicon root of trust (RoT).

The growing capabilities of quantum computers present challenges to hardware security, as classical asymmetric cryptographic algorithms used pervasively throughout hardware security can be easily defeated by a powerful enough quantum computer. In recognizing this risk, the National Institute of Standards and Technology (NIST) has published standards for the new quantum resilient algorithms.

These new quantum resilient algorithms are significantly different from their classical counterparts. Hardware device manufacturers need to pay immediate attention to these changes as they impact foundational hardware security capabilities such as immutable root-of-trust anchors for both code integrity and hardware identity. Currently, the challenges facing silicon components are more significant than for software, due to longer development times and the immutability of hardware. Therefore, immediate action is needed for new hardware designs.

As part of Microsoft’s commitment to our Secure Future Initiative (SFI), and to accelerate the adoption of quantum resilient algorithms, Microsoft and the Caliptra consortium are open-sourcing Adams Bridge, a new silicon block for accelerating quantum resilient cryptography. For more information about Adams Bridge, and how we make our future quantum safe, please visit the Tech Community blog.

In addition to Caliptra 2.0 and Adams Bridge, Microsoft is taking further steps to advance security in hardware supply chains with OCP-SAFE (OCP Security Appraisal Framework Evaluation) initiative. Co-founded by Microsoft, OCP-SAFE calls for systematic and consistent security audits on hardware and firmware. Combined with Caliptra, OCP-SAFE advances transparency and security assurance in the path towards hardware Supply Chain Integrity, Transparency, and Trust (SCITT). Read the Tech Community blog for more information.

Bottlenecks to breakthroughs: Optimizations at every layer in the era of AI

For the past few years, Microsoft has been on this journey to expand our supercomputing scale, enabling individuals and organizations all over the world to reap the benefits of generative AI across domains, from education to healthcare to business and beyond. Along the way, we’ve continued to evolve and enhance our infrastructure, building some of the world’s largest supercomputers with our growing fleet of high-performance accelerators for AI workloads of all shapes and sizes. As we’ve encountered increasing demands for AI innovation, we’ve unlocked performance improvements and efficiencies through system-level optimizations, many of which have been contributed back to the open-source community.

Through the development of our own custom silicon and system with Azure Maia, we’ve invested in performance per watt efficiency through algorithmic codesign of hardware and software. We invested in low precision math to achieve this through an early implementation of the MX data format, a standard we contributed to OCP through our leadership of the Microscaling (MX) Alliance together with AMD, Arm, Intel, Qualcomm, Meta, Microsoft, and NVIDIA.

Next, we tackled the challenge of scaling and wide deployment with our liquid-cooled server design. This innovation ensures that our datacenters worldwide can utilize this technology, contributing the design to the industry to enable broader adoption.

Finally, we recognized that traditional Ethernet was not built for AI performance and scaling. By making significant contributions to the Ultra Ethernet Consortium (UEC), we have extended Ethernet into a fabric capable of delivering the necessary performance, scalability, and reliability for AI applications.

Through these efforts, Microsoft continues to drive innovation and contribute to the broader AI and datacenter community, ensuring that our advancements benefit the entire industry.

We welcome attendees of this year’s OCP Global Summit to visit Microsoft at booth #B35 to explore our latest cloud hardware demonstrations featuring contributions with partners in the OCP community.

Connect with Microsoft at the OCP Global Summit 2024 and beyond:


1Delivering consistency and transparency for cloud hardware security, Rani Borkar. October 18, 2022.

2Learn how Microsoft Azure is accelerating hardware innovations for a sustainable future, Zaid Kahn. November 9, 2021.

3Fostering AI infrastructure advancements through standardization, Rani Borkar and Reynold D’Sa. October 17, 2023.

The post Accelerating industry-wide innovations in datacenter infrastructure and security appeared first on Microsoft Azure Blog.