TCPIP Illustrated, Vol 1

1: Introduction

  • First edition, 1994

  • during the 1990s we have come to realize that this new, bigger island consisting of a single network doesn’t make sense either. People are combining multiple networks together into an internetwork, or an internet. An internet is a collection of networks that all use the same protocol suite.

  • The application and transport layers are end-to-end in that they’re untouched by intermediate systems. The network layer and lower are the opposite, every router rewrites headers at those layers

  • Bridges connect networks at the link layer, routers connect networks at the network layer

    • IP (v4) address classes:
      • A: 0.0.0.0 -> 127.255.255.255
      • B: 128.0.0.0 -> 191.255.255.255
      • C: 192.0.0.0 -> 223.255.255.255
      • D: 224.0.0.0 -> 239.255.255.255
      • E: 240.0.0.0 -> 255.255.255.255
  • Three types of IP addrs: unicast, broadcast, multicast (anycast?)

  • Packet encapsulation

  • Headers typically include a field denoting the (higher-level) protocol that their payload carries

  • TCP servers are typically concurrent, UDP servers are typically iterative, because it rarely makes sense to use concurrent “connections” for a connectionless protocol

  • Well known port numbers are listed in /etc/services

2: Link Layer

  • Hardware (MAC) addresses are typically 48 bits long
  • The ARP/RARP protocols map between hardware and IP addresses
  • Loopback interface to allow a client and server on the same host to communicate with each other using TCP/IP. Most implementations don’t short-circuit the TCP layer when going over the loopback interface. IP packets are prepared and sent out, but no lower layers are involved.
    • Loopback has the range 127.0.0.0/8 reserved for it, with localhost assigned to 127.0.0.1 by convention
    • Datagrams that are broadcasted/multicasted are both sent out over Ethernet (etc.) and copied to the loopback interface
    • 500
  • The network layer (IP) fragments data into multiple packets because lower levels impose a hard limit (MTU) on frame size
    • Different networks can have different MTUs, so fragmentation can occur at any hop, not just at the source (essentially anytime the IP header is rewritten to use a new destination IP, the payload could also be fragmented to accommodate a smaller MTU)
    • Are fragmented packets ever reassembled? Any ordering guarantees between the fragmented packets?

3: IP: Internet Protocol

  • Connectionless, unreliable datagram service

  • No ordering guarantees

  • IP header: 400

  • Big endian ordering regardless of the endianness of the machines involved (network byte order)

  • The TOS field allowed specifying one of these optimizations: minimize delay, maximize throughput, maximize reliabiity, minimize cost

    • But is now (as of 1998) deprecated
  • TTL sets an upper limit on the number of hops the packet can take

    • When this hits zero the packet is discarded, but also the sender is sent an ICMP notification
    • Which is how mtr/traceroute works
    • 600
  • Header checksum only, payloads must have their own checksum

  • Routing

    • When a node receives an IP packet whose destination address doesn’t match one of its own, it can be configured to route the packet onwards using a routing table
    • Entries in the routing table map a given destination address (for a single node [/32] or a network) to the IP address of a “next-hop router”
    • Packets are forwarded to the next hop router, which requires rewriting all the headers lower down the stack than IP (a different MAC address for example)
    • Routers are typically connected to multiple NICs and forward packets from one to another
    • The ability to specify a route to a network, and not have to specify a route to every host, is another fundamental feature of IP routing. Doing this allows the routers on the Internet, for example, to have a routing table with thousands of entries, instead of a routing table with more than one million entries.

  • Subnets

    • Reserve portions of the IP address for sub-networks that are transparent externally
    • As opposed to having each of these sub-networks advertised to the internet individually
    • Subnet masks determine (for a given host) how many bits are used for the network/subnet ID and how many for the host ID
    • The book divides an IP address up into “(network ID, subnet ID, host ID)”, but this is outdated (as of 1993)
    • Subnets are typically /24, but can be larger
      • The example in the book splits 140.252.13.0/24 into two subnets that share the 140.252.13 prefix:
        • 140.252.13.32/27: 800
        • 140.252.13.64/27: 800
      • And these are all the possible /27 subnets for this network:

4: ARP: Address Resolution Protocol

  • Provides a mapping between hardware addresses and IP addresses
  • ARP uses a broadcast mechanism to query for the hardware address of a given IP address
  • Broadcasts use a special hardware address: all ones
  • Mappings are cached on each host (arp -e)
❯ sudo tshark -i enp0s10 -f "arp"
    1 0.000000000 e2:9b:eb:e0:63:1a → Broadcast    ARP 42 Who has 192.168.1.16? Tell 192.168.1.26
    2 1.027814907 e2:9b:eb:e0:63:1a → Broadcast    ARP 42 Who has 192.168.1.16? Tell 192.168.1.26
    3 2.051482603 e2:9b:eb:e0:63:1a → Broadcast    ARP 42 Who has 192.168.1.16? Tell 192.168.1.26
    4 3.075296457 e2:9b:eb:e0:63:1a → Broadcast    ARP 42 Who has 192.168.1.16? Tell 192.168.1.26
    5 4.099433088 e2:9b:eb:e0:63:1a → Broadcast    ARP 42 Who has 192.168.1.16? Tell 192.168.1.26
    6 5.124312772 e2:9b:eb:e0:63:1a → Broadcast    ARP 42 Who has 192.168.1.16? Tell 192.168.1.26
    7 11.909737922 e2:9b:eb:e0:63:1a → Broadcast    ARP 42 Who has 192.168.1.5? Tell 192.168.1.26
    8 11.916431821 Netgear_ff:e1:2d → e2:9b:eb:e0:63:1a ARP 60 192.168.1.5 is at 08:36:c9:ff:e1:2d
    9 17.024220603 Netgear_ff:e1:2d → e2:9b:eb:e0:63:1a ARP 60 Who has 192.168.1.26? Tell 192.168.1.5
   10 17.024277268 e2:9b:eb:e0:63:1a → Netgear_ff:e1:2d ARP 42 192.168.1.26 is at e2:9b:eb:e0:63:1a
   11 33.715456792 Netgear_ff:e1:2d → Broadcast    ARP 60 Who has 192.168.1.1? Tell 192.168.1.5
  • A node can send an ARP request for its own IP to see if any other node on the network is using that IP

5: RARP: Reverse Address Resolution Protocol

  • Get the IP address for a given hardware address
  • Usually used by a host to figure out its own IP when booting up
  • Supplanted by bootp and later DHCP

6: ICMP: Internet Control Message Protocol

  • Communicates error & query messages
  • 500
  • ICMP error payloads contain the header of the IP packet that generated the error, as well as the first 8 bytes of its payload:
  • ICMP can be used for timestamp queries (predating NTP?)

9: IP Routing

  • Routing tables map host/network IDs to gateways (and the interfaces those gateways can be reached on), either specifically or via defaults
❯ route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
default         192.168.1.1     0.0.0.0         UG    0      0        0 enp0s10
10.0.0.0        0.0.0.0         255.255.255.0   U     0      0        0 tun_tcp
192.168.1.0     0.0.0.0         255.255.255.0   U     0      0        0 enp0s10
  • Flags: U->UP, G->Gateway (if unset, the destination is directly connected), H->Host (destination is not a network)

  • The book has 127.0.0.0/8 in netstat -r, output, but this doesn’t show up in Linux

    • Linux uses multiple routing tables:

      ❯ ip rule
      0:	from all lookup local
      32766:	from all lookup main
      32767:	from all lookup default
      
    • And the local table looks like:

    ❯ ip route list table local
    broadcast 10.0.0.0 dev tun_tcp proto kernel scope link src 10.0.0.1
    local 10.0.0.1 dev tun_tcp proto kernel scope host src 10.0.0.1
    broadcast 10.0.0.255 dev tun_tcp proto kernel scope link src 10.0.0.1
    broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1
    local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
    local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
    broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1
    broadcast 192.168.1.0 dev enp0s10 proto kernel scope link src 192.168.1.26
    local 192.168.1.26 dev enp0s10 proto kernel scope host src 192.168.1.26
    
  • Routing error to the calling application when a route can’t be found on the same machine that the datagram originated from

  • ICMP “host unreachable” error is sent for routing errors on intermediate machines/routers

  • Hosts typically ignore rather than forward packets whose final destination is a different host. Linux can be configured to act as a router with the net.ipv4.ip_forward sysctl

  • ICMP redirect

    • Possibly a rudimentary form of dynamic routing
    • When a router receives a packet and the routing decision it makes sends it back out on the same interface
    • That’s a clue that the router that sent the packet can skip the current router entirely
    • So say A sends a packet to B and B forwards it to C, but B both receives and forwards the packet on the same interface
    • In which case it’s likely that A and C are directly connected and B is an unnecessary hop
    • Here B would send an ICMP redirect message to A telling it about C
    • 500
  • ICMP can also be used to discover local routers

    • Either by broadcasting a router solicitation message
    • Or waiting for a periodic broadcast from the router (router advertisement)

10: Dynamic Routing Protocols

  • Dynamic routing: routers broadcast routes to adjacent routers using a routing protocol
  • The Internet is organized into a collection of autonomous systems (ASs), each of which is normally administered by a single entity.
  • Dynamic routing within an AS uses an “interior gateway protocol” like RIP or OSPF
  • Routing between routers in different ASs use an “exterior gateway protocol” like EGP or BGP
  • RIP: routing information protocol
    • Layered over UDP, allows sending 25 routes in a single message
    • Each message has a hop count that’s incremented at every propagation
    • Each router chooses whether or not to apply a route based on the hop count
    • No notion of subnetting, only network ID or host ID
    • RIP v2 supports cross-AS routing, and contains a header field with an AS number
  • OSPF: open shortest path first
    • Each router tests the state of its links to all adjacent nodes
    • And sends this info for each node to all other adjacent nodes
    • Stabilizes faster than RIP after a partition or a device going down
    • “State of a link” is a cost model based on any dimension, like throughput, RTT, reliability, etc.
    • Cost-based load balancing when multiple valid routes exist
  • BGP: border gateway protocol
    • Two systems running BGP establish a TCP connection and exchange their entire routing tables, after which the connection stays open for incremental updates
    • Once this happens between many ASes each one has built up a graph of AS connectivity, not just the very next hop
    • Three types of ASes:
      • Stub: connected to only one other AS, but only carries traffic destined for nodes in the AS
      • Multihomed: connected to more than one other AS, but only carries traffic destined for nodes in the AS
      • Transit: connected to more than one AS, and carries both local and transit traffic

11: UDP: User Datagram Protocol

  • Because IP headers have a protocol field, UDP and TCP port numbers each occupy entirely different namespaces, even on Linux.
  • Checksum covers payload, unlike IP, and includes a pseudo-header with fields from the IP header
  • UDP + IP fragmentation could be problematic. If one of the fragmented packets is lost, the receiver can’t reassemble the original packet unless the host retransmits the entire thing
  • Path MTU: the smallest MTU in the entire routed path to the target
    • If the IP header says “don’t fragment” but an intermediate router needs to fragment because the incoming MTU is too large, it drops the packet and sends an ICMP “fragmentation required” error
    • Can use this ICMP error (like traceroute) to determine the path MTU by choosing a large value and decrementing until the error no longer shows up

12: Broadcasting & Multicasting

  • Doesn’t make sense for TCP, only connectionless protocols
  • NICs typically see every ethernet frame but only receive ones that have the right address (host MAC or the broadcast address)
  • NICs can be placed in promiscuous mode where they receive all frames

Broadcast

  • Broadcast IPs
    • Limited broadcast IP: 255.255.255.255
      • Never forwarded by a router, only local
    • Net-directed broadcast / all-subnets-directed broadcast: <net_id>.255.255.255
    • Subnet-directed broadcast: subnet ID followed by all ones
  • This works with things like ping for example:
    ❯ ping -b 255.255.255.255
    WARNING: pinging broadcast address
    PING 255.255.255.255 (255.255.255.255) 56(84) bytes of data.
    64 bytes from 192.168.1.3: icmp_seq=1 ttl=64 time=1.11 ms
    64 bytes from 192.168.1.15: icmp_seq=1 ttl=64 time=15.4 ms
    64 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=299 ms
    64 bytes from 192.168.1.18: icmp_seq=1 ttl=64 time=299 ms
    64 bytes from 192.168.1.8: icmp_seq=1 ttl=64 time=299 ms
    64 bytes from 192.168.1.3: icmp_seq=2 ttl=64 time=1.35 ms
    64 bytes from 192.168.1.15: icmp_seq=2 ttl=64 time=9.49 ms
    64 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=191 ms
    64 bytes from 192.168.1.18: icmp_seq=2 ttl=64 time=194 ms
    64 bytes from 192.168.1.8: icmp_seq=2 ttl=64 time=194 ms
    64 bytes from 192.168.1.2: icmp_seq=2 ttl=64 time=522 ms
    64 bytes from 192.168.1.3: icmp_seq=3 ttl=64 time=1.05 ms
    64 bytes from 192.168.1.15: icmp_seq=3 ttl=64 time=21.6 ms
    64 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=112 ms
    64 bytes from 192.168.1.18: icmp_seq=3 ttl=64 time=114 ms
    64 bytes from 192.168.1.8: icmp_seq=3 ttl=64 time=114 ms
    64 bytes from 192.168.1.2: icmp_seq=3 ttl=64 time=223 ms
    ^C
    --- 255.255.255.255 ping statistics ---
    3 packets transmitted, 3 received, +14 duplicates, 0% packet loss, time 2008ms
    rtt min/avg/max/mdev = 1.054/153.604/522.394/141.550 ms
    

Multicast

  • Hosts can elect to be a part of host groups (via IGMP).
    • IP packets (and associated Ethernet frames) can be targeted to a specific group.
    • NICs will receive all packets targeted to their group(s)
    • 600
  • A multicast group address is the combination of the high-order 4 bits of 1110 and the multicast group ID. These are normally written as dotted-decimal numbers and are in the range 224.0.0.0 through 239.255.255.255.
  • Some multicast addresses are well-known, such as:

    For example, 224.0.0.1 means “all systems on this subnet,” and 224.0.0.2 means “all routers on this subnet.” The multicast address 224.0.1.1 is for NTP, the Network Time Protocol, 224.0.0.9 is for RIP-2 (Section 10.5), and 224.0.1.2 is for SGI’s (Silicon Graphics) dogfight application.

13: IGMP: Internet Group Management Protocol

  • Hosts send IGMP messages saying they’re either joining or leaving a given multicast group
  • Multicast routers store this mapping and refresh it periodically by sending IGMP queries
  • Routers only need to know whether a given group has at least one active host
  • When the router receives an IP packet destined for the group, it multicasts it into the local network if a receiver exists for it

14: DNS

  • Hierarchical namespace

  • Fully qualified domain names (FQDNs) end with a period

  • A zone is a subtree that’s managed separately

  • Each domain at the second level and below can have authoritative name servers that cover that zone

  • Theres a set of root name servers with known IPs that know all the authoritative name servers for second-level domains (TLDs)

  • A DNS packet contains a variable number of: question, answer, authority, and additional fields

  • Reverse DNS Queries

    • To look up the domain for an IP, use the pseudo domain in-addr.arpa
    • Specifically for IP A.B.C.D, look for a PTR record on the domain D.C.B.A.in-addr.arpa., which should give you the domain(s) that point to A.B.C.D
      ❯ dig news.ycombinator.com
      news.ycombinator.com.	1	IN	A	50.112.136.166
      
      ❯ dig -t PTR 166.136.112.50.in-addr.arpa.
      166.136.112.50.in-addr.arpa. 191 IN	PTR	ec2-50-112-136-166.us-west-2.compute.amazonaws.com.
      
  • Record types

    • A: defines an IP address
    • PTR: for reverse (IP->name) queries
    • CNAME: “canonical name”, for aliasing
    • MX: mail exchange
    • NS: “name server”; specify the authoritative name server for a domain
    • SOA: “start of authority”, used to designate the primary name server and administrator responsible for a zone; the presence of these records indicate the root of a zone
  • The DNS protocol includes a “truncated” flag for large responses; if this happens the client is to redo the query using TCP

  • DNS lookup flow (for example: news.ycombinator.com)

    • Stub resolver (on a client machine) forwards query to recursive resolver

    • Recursive resolver looks up (or knows) the root nameservers, and then:

    • Sends query com. to a root nameserver and receives authoritative nameservers for com.

    • Sends query ycombinator.com. to a com. nameserver and receives authoritative nameservers for ycombinator.com.

    • Sends query news.ycombinator.com. to a ycombinator.com. nameserver and receives an A record

    • A basic resolver is actually pretty simple to write:

      	nameserver := "198.41.0.4" // one of the 13 root nameservers
      	answer := []dns.RR{}
      	c := new(dns.Client)
      
      	for {
              fmt.Printf("Querying for %s against nameserver %s\n", name, nameserver)
      
              m := new(dns.Msg)
              m.SetQuestion(name, dns.TypeA)
      
              in, _, err := c.Exchange(m, fmt.Sprintf("%s:53", nameserver))
              if err != nil {
                  panic(err)
              }
      
              if len(in.Answer) > 0 {
                  answer = in.Answer
                  break
              }
      
              if len(in.Extra) == 0 {
                  panic("EMPTY RESPONSE")
              }
      
              rr := in.Extra[0].(*dns.A)
              nameserver = rr.A.String()
      	}
      
      	if answer == nil {
              panic("COULDN'T RESOLVE")
      	}
      
      	fmt.Println(answer)
      
    • With the caveat that it doesn’t work when an intermediate step returns an NS record in the authority section without the IPs of that NS record in the additional section (which would normally require an extra out-of-band lookup)

    • But it works fine when all intermediaries return A records for nameservers in the additional section:

      ❯ ./dns news.ycombinator.com.
      Querying for com. against nameserver 198.41.0.4
      Querying for ycombinator.com. against nameserver 192.12.94.30
      Querying for news.ycombinator.com. against nameserver 205.251.192.225
      [news.ycombinator.com.  1       IN      A       50.112.136.166]
      

17: TCP: Transmission Control Protocol

  • Connection-oriented protocol to provide a reliable byte-stream over an unreliable medium
  • No markers: writes don’t necessarily correspond with reads. Blocks that are written 4kB at a time may be read 2kB at a time (or 50kB at a time)
  • Each byte in the byte stream has a “sequence number”
    • Sequence numbers wrap after 2^32 - 1 and are selected randomly (to start with)
    • The SYN and FIN messages during connection setup/teardown each consume a sequence number
  • Connections are identified by (source port, source IP, dest port, dest IP)
  • In a TCP header:
    • Sequence number: identifies the first byte in the current packet’s payload -OR- identifies the initial sequence number if the SYN flag is set
    • Acknowledgment number: the next sequence number the sender of this packet expects to receive, only meaningful if the ACK flag is set
    • Window size: number of bytes the sender of this packet has available to store incoming data (flow control)

18: TCP Connection Establishment and Termination

  • Establishing a connection
    • Three-way handshake
      1. Client sends a SYN specifying the start of the client’s sequence space
      2. Server sends a SYN specifying the start of the server’s sequence space
      3. Server sends an ACK to acknowledge receipt of the sequence number sent in 1.
      4. Client sends an ACK to acknowledge receipt of the sequence number sent in 2.
    • Steps 2 & 3 are typically combined, so it’s SYN,SYN+ACK,ACK
  • Closing a handshake
    • One side can close the connection and continue receiving data from the other
      • The closer must send ACKs as normal without sending any data
      • This could be a reasonable method to signal EOF
    • Four-step termination
      1. One side sends a FIN
      2. Other side sends an ACK to acknowledge the FIN
      3. (Later…) Other side sends a FIN
      4. The initial closer sends an ACK to acknowledge the FIN from 3.
  • Maximum Segment Size (MSS)
    • This is an “option” that can be sent with a SYN to announce the largest-sized segment the receiver is willing to receive
    • Used to avoid fragmentation
  • Connection states
    • All states + transitions: 600
    • States during connection/termination: 500
    • Connections must stay in TIME_WAIT for 2x the “maximum segment lifetime” to avoid mixing up segments between connections.
      • All segments received against a TIME_WAIT connection are discarded.
      • Can set the SO_REUSEADDR flag (to socket) to allow conflicts with TIME_WAIT sockets, which is required to restart servers without waiting for MSL expiry
    • A connection can be stuck in the half-closed FIN_WAIT_2 state forever, so most implementations use a timeout here
  • Quiet time
    • Wait for MSL seconds after a crash to avoid sending stale segments past the MSL
  • Reset/RST
    • Send a reset when a packet arrives for a non-exisistent connection
    • Or when you want to abort the connection
  • TCP allows for simultaneous opens, where two machines connect to each other at well-known ports at the same time, leading to one single connection, not two
    • This is hard to artificially replicate - can only be triggered if both SYNs are in flight simultaneously
  • Also simultaneous closes, where both sides independently send FINs at the same time
    • When a FIN is received when in FIN_WAIT_1 but the accompanying ACK doesn’t cover the FIN that was just sent out
    • Then it must be a simultaneous close
  • When a server’s accept queue is full, it typically (this is true as of Linux 5.18.0) drops incoming SYNs without sending back a RST, encouraging retransmission

19: TCP Interactive Data Flow

  • Delayed acknowledgements
    • Wait for a bit (up to 200ms) to allow for new data to piggyback on the ACK
    • Disable this on Linux with TCP_QUICKACK
  • Nagle’s Algorithm
    • Coalesce many small payloads into fewer, larger payloads
    • Only allow a single un-ACKed small segment to be in flight at a given time
    • In the meantime, small segments are buffered until an ACK comes back
    • When the buffered data becomes larger than the segment size, it isn’t considered “small” anymore, and can be sent without waiting for an ACK
    • Disable this on Linux with TCP_NODELAY

20: TCP Bulk Data Flow

  • If an ACK is sent with a small (or zero) window size, TCP may send a second ACK once the window grows larger (if no data is received in that interval). This is called a window update

  • Receipt of out-of-order segments must trigger duplicate ACKs

  • Sliding window

    • 400
    • And from the RFC:
        Send Sequence Space
      
                     1         2          3          4
                ----------|----------|----------|----------
                       SND.UNA    SND.NXT    SND.UNA
                                            +SND.WND
      
          1 - old sequence numbers which have been acknowledged
          2 - sequence numbers of unacknowledged data
          3 - sequence numbers allowed for new data transmission
          4 - future sequence numbers which are not yet allowed
      
                            Send Sequence Space
      
                                 Figure 4.
      
  • The PSH flag is used to tell a receiver to immediately flush the read buffer to the application and not wait around for more data. Sounds like this was already semi-deprecated in 1992, although nc still sends it:

  • Slow start

    • Senders transmitting enough data to fill the receiver’s window may overwhelm intermediary routers/etc.
    • Senders maintain a congestion window, which starts at one segment in length
    • Every ACK increases the size of this window by one segment
    • The sender doesn’t ever transmit past the congestion window

21-23: TCP Timeout and Retransmission + Other Timers

  • Four timers
    • Retransmission: senders set a timer and retransmit if not ACKed when the timer fires
    • Persist: keep window updates flowing
    • Keepalive: detect disconnects/crashes on an idle connection
    • 2MSL: move a connection from TIME_WAIT to CLOSED

Retransmission Timer

  • Retransmission

    • Retransmission timeouts are based on the measured RTT of the network, and use a constant factor (either 2 or 4 according to this book) to create exponential backoff
    • The RTT is measured by recording the time between sending a segment and having it ACKed. This doesn’t work for a retransmission though, because you can’t tell if the ACK was for the original segment or the retransmitted on. Ambiguous RTT values are not used in the timeout calculation.
  • Congestion avoidance

    • Slow start isn’t sufficient, because at some point you’re going to hit the limit of an intervening router/etc. anyway
    • Keep track of an ssthresh variable: this is number of segments at which slow start stops and a congestion avoidance algorithm takes over
    • A new connection starts off with a congestion window of 1 segment. This repeatedly doubles as ACKs are recevied (this is slow start), until the congestion window’s size passes ssthresh.
    • At this point the size of the congestion window is now controlled by a congestion avoidance algorithm, which is more conservative than slow start.
    • Here the graph flattens once the connection switches from slow start to a congestion avoidance algorithm: 600
  • Fast retransmit

    • Receivers send a duplicate ACK when receiving a segment that’s out of order, like ack 6657 here: 500

    • When a sender receives a duplicate ACK, this may mean one of:

      • The receiver received segments out of order, but all segments are reliably delivered
      • The reciever received segments out of order because one segment was irrevocably dropped
    • In the first case we don’t really want to retransmit, but in the second we do. To disambiguate, wait for three duplicate ACKs, which is strong signal that we’re seeing the second scenario. In this case, perform a retransmit immediately.

    • After performing a fast retransmit, apply congestion control because we’re assuming that a segment was dropped due to congestion. Here are two common schemes (from Wikipedia): *

      Tahoe: if three duplicate ACKs are received (i.e. four ACKs acknowledging the same packet, which are not piggybacked on data and do not change the receiver’s advertised window), Tahoe performs a fast retransmit, sets the slow start threshold to half of the current congestion window, reduces the congestion window to 1 MSS, and resets to slow start state.

      • Reno: if three duplicate ACKs are received, Reno will perform a fast retransmit and skip the slow start phase by instead halving the congestion window (instead of setting it to 1 MSS like Tahoe), setting the ssthresh equal to the new congestion window, and enter a phase called fast recovery.

  • Some of this data (RTT, congestion window size, ssthresh) are saved against the route (ip route) in the routing table for future connections

  • Use ss to check these metrics for a connection:

    ❯ ss -ti 'sport == 4001 || dport == 4001'
    State                  Recv-Q                  Send-Q                                   Local Address:Port                                      Peer Address:Port                   Process
    ESTAB                  0                       0                                            127.0.0.1:50294                                        127.0.0.1:4001
       cubic wscale:7,7 rto:204 rtt:0.059/0.027 mss:32768 pmtu:65535 rcvmss:536 advmss:65483 cwnd:10 bytes_sent:12 bytes_acked:13 segs_out:4 segs_in:3 data_segs_out:2 send 44.4Gbps lastsnd:27768 lastrcv:71148 lastack:27768 pacing_rate 87.9Gbps delivery_rate 7.71Gbps delivered:3 app_limited rcv_space:65495 rcv_ssthresh:65495 minrtt:0.034 snd_wnd:65536
    ESTAB                  0                       0                                            127.0.0.1:4001                                         127.0.0.1:50294
       cubic wscale:7,7 rto:200 rtt:0.057/0.028 ato:40 mss:32768 pmtu:65535 rcvmss:536 advmss:65483 cwnd:10 bytes_received:12 segs_out:2 segs_in:4 data_segs_in:2 send 46Gbps lastsnd:71148 lastrcv:27768 lastack:69836 pacing_rate 92Gbps delivered:1 rcv_space:65483 rcv_ssthresh:65483 minrtt:0.057 snd_wnd:65536
    
    
  • Retransmits don’t have to resend the same exact packet that was sent the first time. More data can be stuffed in there if necessary.

Persist Timer

  • If both the sender and the receiver have zero windows/full buffers (or an ACK is lost), it’s possible for the connection to be deadlocked
  • To avoid this, senders use a persist timer to periodically check whether the receiver is now able to accept data
  • Silly window syndrome: receievers advertise small windows instead of waiting and advertising larger windows to minimize overhead

Keepalive Timer

  • TCP connections are kept alive simply by each peer holding connection state, but no active measures (like polling) are necessary to maintain a connection
  • This only applies when both hosts have not crashed though. It’s possible for one host to crash, and for the other to think the connection is still up when it isn’t
  • TCP implementations (but not the RFC) include a keepalive timer to periodically send packets on idle connections to make sure both hosts are up

24: TCP Futures and Performance

Path MTU discovery

  • TCP starts off with the MTU of the outgoing interface or the MSS announced by the other side, whichever is smaller, and sets the don’t fragment (DF) bit on the IP packet
  • If an intermediate router is unable to transmit this packet because of a smaller MTU, it generates an ICMP message. TCP sees this and retransmits with a smaller segment size.
  • Routes can change dynamically, so after a while TCP gradually increases this value back up to the original (after 10 minutes by default according to RFC 1191)

Long Fat Pipes

  • Nomenclature
    • Define the capacity of a connection to be bandwidth * RTT, which is a measure for the max amount of data that can be in flight on that connection in a given instant
    • Also called the “bandwidth-delay product”, or simply the size of the pipe
    • 800
    • Networks with a high capacity are “long fat” networks, and connections on these networks are “long fat pipes”
  • Long fat pipes are bad for latency but can be great for throughput
  • TCP can’t (by default) optimize for throughput because the window size maxes out at 64kB (16-bit field)
    • There’s a “window scale” option that both sides have to use during their SYNs
    • Set the “window scale” to a value between 0 and 14
    • The “window” field is then interpreted as window * (2 ^ scaling_factor)
    • Max window size is now 1GB

Timestamp

  • Insert a monotonic counter into every segment, the receiver returns this value unchanged with an ACK, and the sender can determine how many times the counter ticked in between
  • This is a better means of measuring RTT than having to maintain local state against every transmitted segment, but also:
    • Wrapped & duplicated sequence numbers are measured individually
    • Retransmitted segments are measured individually
  • The timestamp can also be used to guard against wrapped sequence numbers by identifying segments with plausible sequence numbers but stale timestamps relative to other timestamps being received in the same area of the sequence number space
Edit