Handling Congestion and Routing Failures in Data Center Networking
Cox, Alan L.
Doctor of Philosophy
Today's data center networks are made of highly reliable components. Nonetheless, given the current scale of data center networks and the bursty traffic patterns of data center applications, at any given point in time, it is likely that the network is experiencing either a routing failure or a congestion failure. This thesis introduces new solutions to each of these problems individually and the first combined solutions to these problems for data center networks. To solve routing failures, which can lead to both packet loss and a loss of connectivity, this thesis proposes a new approach to local fast failover, which allows for traffic to be quickly rerouted. Because forwarding table state limits both the fault tolerance and the largest network size that is implementable given local fast failover, this thesis introduces both a new forwarding table compression algorithm and Plinko, a compressible forwarding model. Combined, these contributions enable forwarding tables that contain routes for all pairs of hosts that can reroute traffic even given multiple arbitrary link failures on topologies with tens of thousands of hosts. To solve congestion failures, this thesis presents TCP-Bolt, which uses lossless Ethernet to prevent packets from ever being dropped. Unlike prior work, this thesis demonstrates that enabling lossless Ethernet does not reduce aggregate forwarding throughput in data center networks. Further, this thesis also demonstrates that TCP-Bolt can significantly reduce flow completion times for medium sized flows by allowing for TCP slow-start to be eliminated. Unfortunately, using lossless Ethernet to solve congestion failures introduces a new failure mode, deadlock, which can render the entire network unusable. No existing fault tolerant forwarding models are deadlock-free, so this thesis introduces both deadlock-free Plinko and deadlock-free edge disjoint spanning tree (DF-EDST) resilience, the first deadlock-free fault tolerant forwarding models for data center networks. This thesis shows that deadlock-free Plinko does not impact forwarding throughput, although the number of virtual channels required by deadlock-free Plinko increases as either topology size or fault tolerance increases. On the other hand, this thesis demonstrates that DF-EDST provides deadlock-free local fast failover without needing virtual channels. This thesis shows that, with DF-EDST resilience, less than one in a million of the flows in data center networks with thousands of hosts are expected to fail even given tens of failures. Further, this thesis shows that doing so incurs only a small impact on the maximal achievable aggregate throughput of the network, which is acceptable given the overall decrease in flow completion times achieved by enabling lossless forwarding.