18607733834
新闻中心

新闻中心

NEWS

当前位置:首页 >新闻中心 > 矩阵光开关的应用

矩阵光开关的应用

上传时间:2023-10-16   浏览次数:3595

Jupiter evolving transforming google's datecenter network via optical circuit switches and software-defined networking(1).pdf

Lightwave Fabrics At-Scale Optical Circuit Switching for Datacenter and Machine Learning Systems  - Page 17 to 33(1).pdf

ABSTRACT

   We present a decade of evolution and production experience with Jupiter datacenter network fabrics. In this period Jupiter has delivered 5x higher speed and capacity, 30% reduction in

capex, 41% reduction in power, incremental deployment and technology refresh all while serving live production traffic. Akey enabler for these improvements is evolving Jupiter from a

Clos to a direct-connect topology among the machine aggregation blocks. Critical architectural changes for this include: Adatacenter interconnection layer employing Micro-ElectroMechanical Systems (MEMS) based Optical Circuit Switches (OCSes) to enable dynamic topology reconfiguration, centralized Software-Defined Networking (SDN) control for traffic

engineering, and automated network operations for incremental capacity delivery and topology engineering. We showthat the combination of traffic and topology engineering on

direct-connect fabrics achieves similar throughput as Clos fabrics for our production traffic patterns. We also optimize for path lengths: 60% of the traffic takes direct path from

source to destination aggregation blocks, while the remaining transits one additional block, achieving an average blocklevel path length of 1.4 in our fleet today. OCS also achieves

3x faster fabric reconfiguration compared to pre-evolution Clos fabrics that used a patch panel based interconnect.

   

KEYWORDS

Datacenter network, Software-defined networking, Traffic engineering, Topology engineering, Optical circuit switches. 


ACM Reference Format: 

Leon Poutievski, Omid Mashayekhi, Joon Ong, Arjun Singh, Mukarram Tariq, Rui Wang, Jianan Zhang, Virginia Beauregard, Patrick Conner, Steve Gribble, Rishi Kapoor, Stephen Kratzer, Nanfang Li,Hong Liu, Karthik Nagaraj, Jason Ornstein, Samir Sawhney, Ryohei Urata, Lorenzo Vicisano, Kevin Yasumura, Shidong Zhang, Junlan Zhou, Amin Vahdat Google sigcomm-jupiter-evolving@google.com. 2022. Jupiter Evolving: Transforming Google’s Datacenter Network via Optical Circuit Switches and Software-Defined Networking. In Proceedings of ACM Conference (SIGCOMM’22). ACM, New York, NY, USA, 20 pages.

https://doi.org/10.1145/3544216.3544265


INTRODUCTION

   Software-Defined Networking and Clos topologies [1, 2, 14,24, 33] built with merchant silicon have enabled cost effective, reliable building-scale datacenter networks as the basis for Cloud infrastructure. A range of networked services, machine learning workloads, and storage infrastructure leverage uniform, high bandwidth connectivity among tens of

thousands of servers to great effect.

   While there is tremendous progress, managing the heterogeneity and incremental evolution of a building-scale network has received comparatively little attention. Cloud infrastructure grows incrementally, often one rack or even

one server at a time. Hence, filling an initially empty building takes months to years. Once initially full, the infrastructure evolves incrementally, again often one rack at a time with

the latest generation of server hardware. Typically there is no in advance blueprint for the types of servers, storage, accelerators, or services that will move in or out over the

lifetime of the network. The realities of exponential growth and changing business requirements mean that the best laid plans quickly become outdated and inefficient, making incremental and adaptive evolution a necessity.

   Incremental refresh of compute and storage infrastructure is relatively straightforward: drain perhaps one rack’s worth of capacity among hundreds or thousands in a datacenter and

replace it with a newer generation of hardware. Incremental refresh of the network infrastructure is more challenging as Clos fabrics require pre-building at least the spine layer

for the entire network. Doing so unfortunately restricts the datacenter bandwidth available to the speed of the network technology available at the time of spine deployment.

   Consider a generic 3-tier Clos network comprising machine racks with top-of-the-rack switches (ToRs), aggregation blocks connecting the racks and spine blocks connecting the aggregation blocks (Fig 1). A traditional approach to Clos will require pre-building spine at the maximum-scale (e.g., 64 aggregation blocks with Jupiter [33]) using the technology of

the day. With 40Gbps technology, each spine would support 20Tbps burst bandwidth. As the next generation of 100Gbps becomes available, the newer aggregation blocks can support 51.2Tbps of burst bandwidth, however, these blocks would be limited to the 40Gbps link speed of the pre-existing spine blocks, reducing the capacity to 20Tbps per aggregation block. Ultimately, individual server and storage capacity would be derated because of insufficient datacenter network bandwidth. Increasing compute power without corresponding network bandwidth increase leads to system imbalance and stranding of expensive server capacity. Unfortunately, the nature of Clos topologies is such that incremental refresh of the spine results in only incremental improvement in the capacity of new-generation aggregation blocks. Refreshing the entire building-scale spine is also undesirable as it would

be expensive, time consuming, and operationally disruptive given the need for fabric-wide rewiring. 

   We present a new end-to-end design that incorporates Optical Circuit Switches (OCSes) [31]

1 to move Jupiter from a Clos to a block-level direct-connect topology that eliminates the spine switching layer and its associated challenges altogether, and enables Jupiter to incrementally incorporate 40Gbps, 100Gbps, 200Gbps, and beyond network speeds. The

direct-connect architecture is coupled with network management, traffic and topology engineering techniques that allow Jupiter to cope with the traffic uncertainty, substantial

fabric heterogeneity, and evolve without requiring any downtime or service drains. Along with 5x higher speed, capacity, and additional flexibility relative to the static Clos fabrics,

these changes have enabled architectural and incremental 30% reduction in cost and a 41% reduction in power. This work does not raise any ethical issues.