Meta is showcasing its latest advancements in open artificial intelligence hardware at the Open Compute Project (OCP) Global Summit 2024. These innovations are aimed at fostering industry collaboration and accelerating the development of next-generation AI infrastructure.
Among the key announcements is Catalina, a new high-powered rack solution designed for demanding AI workloads. Catalina, based on the NVIDIA Blackwell platform, emphasizes modularity and flexibility and supports the latest NVIDIA GB200 Grace Blackwell Superchip. It features the Orv3 high-power rack, capable of supporting up to 140kW, and a liquid-cooled design incorporating a compute tray, switch tray, Wedge 400 fabric switch, management switch, battery backup, and rack management controller. Meta intends for Catalina to be customizable to suit diverse AI workload requirements and to align with industry standards.
The company also revealed an expansion of its Grand Teton AI platform to now support AMD Instinct MI300X accelerators. This updated version of Grand Teton, like its predecessors, maintains a monolithic system design integrating power, control, compute, and fabric interfaces for simplified deployment and scalability in large AI inference workloads. The platform offers increased compute capacity, expanded memory for larger models, and enhanced network bandwidth, accommodating a range of accelerator designs beyond the newly added AMD option.
In networking, Meta introduced its Disaggregated Scheduled Fabric (DSF) for next-generation AI clusters. This open and vendor-agnostic networking solution utilizes the OCP-SAI standard and Meta’s FBOSS network operating system. DSF aims to overcome limitations in scale, component supply, and power density by disaggregating the network fabric. It supports an Ethernet-based RoCE interface and is compatible with various GPUs and NICs from vendors including NVIDIA, Broadcom, and AMD. Alongside DSF, Meta has also developed new 51T fabric switches based on Broadcom and Cisco ASICs and unveiled FBNIC, a new NIC module incorporating Meta’s first in-house designed network ASIC.
Furthermore, Meta highlighted its ongoing partnership with Microsoft within the OCP, particularly their joint work on Mount Diablo, a disaggregated power rack featuring a scalable 400 VDC unit for improved efficiency and scalability, which allows for a higher density of AI accelerators per rack.
Meta underscored its commitment to open source AI, emphasizing that open hardware solutions are crucial for creating high-performance, cost-effective, and adaptable infrastructure necessary for continued progress in the field. The company encouraged participation within the OCP community to collaboratively address the infrastructure demands of AI and unlock its full potential.
Leave a Reply