Tuesday, February 10, 2015

40G MORE is NOT always BETTER - Why 40G can potentially delay your packets?

40G might actually delay your packets and here is the reason why. 
Before we get too much into the semantics behind this, let define some key terms in latency measurements. I am also going to leave out the minor benefits gained from "inter-frame gap" of 40G packets. If you are interested in further information regarding latency testing/benchmarking standards please see RFC 2544.

Switching modes

Cut-Through Switching

Technically the frames can be switched after receiving the first 6 bytes which should contain the destination MAC, however most of the switches (including Arista/Nexus) will wait for the first 54 bytes to be received (62 bytes if you count the 1 byte Start of frame delimiter and 7 bytes preamble) before making the forwarding decision/transmitting the packets out since collisions/errors are usually detected within these first bytes. I've even heard / read that Cisco/Arista waits for the first 64 bytes (+8 if you count the SFD and preamble) before switching the packet. Perhaps this somewhat resembles fragment-free switching.
FIFO is the best way to measure the latency of a cut-through switching since the packets are switched almost as fast as they are arriving. Keep in mind that LIFO measurements will produce negative latency results for a cut-through switching. Perhaps, we should measure LIFO latency on switches so we can race towards zero at a much faster rate .

Store-and-Forward Switching

This is when the entire packet is stored in buffer and then they are transmitted out.
LIFO would be the best way to measure the latency for store-and-forward switching because the entire packet has to be stored before it can be switched. Note that when comparing latency numbers of 2 different switching methods we need to stick with the same latency measurement method so we are comparing apples to apples.
In addition, LIFO (for store and forward) vs FIFO (for cut-through) will show that LIFO latencies of a store-and-forward are better, misleading one to believe that "Store-and-Forward" is actually faster than cut-through.

Why even discuss cut-through VS. store-and-forward, aren't most of the switches cut-through these days?

This is true for the most part except when there is a speed conversion (not media conversion). When going from SLOW to FAST, the switches operate in store-and-forward mode to avoid bit gaps. For example, when switching from ingress 10G to egress 40G the packets are actually stored first and then switched. However, switching from ingress 40G to 10G is cut-through we just loose out on the 30G.
So if we happen to get a 40G cross connect to another switch we would actually be delaying the outbound traffic because all of the servers/switch uplinks are 10G, this would be ingress 10G and egress 40G. For example, let's measure the FIFO latency of a 1250 byte packet destined outbound @ 40G vs 10G.
Serialization delay of 1250 bytes @ 10G ~ 1us
Switch Latency ~ 250nanoseconds
FIFO of 1250 bytes 10G to 40G (as this will be store-and-forward) ~ 1250 nanoseconds.
Serialization delay of the first 54 bytes (as this will be cut-through) @ 10G ~ 43.2nanoseconds
Switch Latency ~ 250nanoseconds
FIFO of 1250 bytes 10G to 10G (as this will be cut-through) ~ 293.2nanoseconds
SRC SpeedDST SpeedLatencySwitching mode
10G10G239.2nsCut-through
10G40G1250nsStore-and-Forward
However, traffic coming inbound to RTR1 will be slightly faster, this would be ingress 40G and egress 10G which is cut-through. For example, let's measure the FIFO latency of a 1250 byte packet sourced from RTR2 to RTR1 @ 40G vs 10G.
Serialization delay of 54 bytes @ 40G = 10.8nanoseconds
Switch Latency ~ 250nanoseconds
FIFO latency of 1250 bytes @ 40G to 40G ~ 260.8 nanoseconds
Serialization delay of 54 bytes @ 10G ~ 43.2nanoseconds
Switch Latency ~ 250nanoseconds
FIFO latency of 1250 bytes ~ 293.2nanoseconds
SRC SpeedDST SpeedLatencySwitching mode
40G10G293.2nsCut-through
40G40G260.8nsCut-through


This means that unless you are 40G all the way through end-to-end, getting a 40G cross connect would really end up delaying the packets outbound from RTR1. 

Many more articles to come so ....

Please subscribe/comment/+1 if you like my posts as it keeps me motivated to write more and spread the knowledge