TEA: Enabling State-Intensive Network Functions on Programmable Switches

[TOC]

Background

傳統實體NF設備或是Virtualized NF on server支援的traffic rate大約只有數十Gbps, 比P4 switch的3Tbps低了數十倍
但NF實作在P4 Switch上時，雖然好處是可以處理很高的traffic rate，缺點是Switch上的SRAM只有數十MB，當concurrent flows很多時，並不足以儲存那麼大的flow table

Introduction

TEA整套系統透過部屬在ToR Switch上，利用機櫃裡其他server上的DRAM儲存大部分的flow entries，switch上的SRAM只作為cache儲存少部分常用的flow entries，當cache miss時才會去server上的TEA table找對應的entry
TEA table是Key-value table，透過hash function把5-tuple(key)對應到flow entry的memory address
TEA的目標對象是compute-light且state-heavy的NF，例如:firewall, vpn gateway, NAT等等。這些NF需要較大的flow table儲存flow states。
TEA獨特的好處： (放下一段)

比較過去已有解法

過去也有其他人用別的方法應對Switch上SRAM不夠大的問題

RPC Solution 同樣將flow entries放在其他server的DRAM，透過socket-programming的方式access

問題：
- 涉及switch control-plane CPU 及 server CPU，會有OS排程優先順序的問題，導致Latency高而且無法預測
- switch control-plane與data-plane之間PCIe bus bandwidth有限，會形成bottleneck
利用switch上的DRAM 用control-plane或是on-board off-chip DRAM => 同樣會卡在memory bandwidth => not scalable

TEA解法

避免CPU involvement => 不要用socket，透過類似RDMA的方式，直接在data-plane加上RoCE Header，server端的網卡收到後即可access對應的DRAM address，避免了OS排程問題，使得latency較穩定 => 不經過control-plane，避免了control-plane與data-plane之間bandwidth有限的問題

TEA Design Challenges & Solution

DRAM access in data-plane 內容和上一段類似。特別要提的是，受限於ASIC的編程能力，TEA只實作了部分的RDMA protocol。在一些前提與對應的workaround之下，才能說RDMA connection是reliable的
TEA table - Single Round-trip Lookup
- 因為access remote memory的時間成本太高(2 µs), 所以希望在1個RTT內完成
- 最初解法： Cuckoo-hashing, 2個hash function所以要檢查2個bucket, O(1) look-up, 但要做2次look-up才能找到entry
- 最終解法： Bounded Linear Probing(BLP), 1個RTT，但這在別的paper才有詳細描述, 感覺像上面的改良版
- 當有collision發生時，會直接放在switch的SRAM, 約0.1%比例的entry
TEA table - Deferred Packet Processing
- 當access remote entry的這段時間，我們希望switch能接著處理後面的packet, pipeline不要因此卡住，也就是要能做到async. lookup
=> 解法：多花一個RDMA write把正在處理的packet buffer到DRAM上，之後用RDMA read讀回flow entry時，在一併把該packet讀回來處理 => 成本： 1. 占用switch<–>server bandwidth變多 2. 占用switch的egress pipeline的bandwidth變多 3. 占用server上的memory bandwidth
Multiple DRAM Servers
- 要把TEA-Table分成多塊儲存在多台server上時，就得維護entry跟server ID的mapping
- 傳統作法：用Distributed hash scheme, e.g. consistent hashing, 把每條flow entry均勻的拆散到各台server上
  - 問題：mapping table會變很大，無法存在switch (consistent hashing為了均勻所加入的virtual node所導致)
- 最終做法： 直接將TEA-Table拆成N個連續的sub-table，這樣只要存N個range-mapping rule就好
  - 問題：Load Balance不如上面好，故用count-min sketch把最常出現的O(NlgN) flow entries cache在switch SRAM，可有效減少占用switch<–>server bandwidth
High Availability
- 要能快速偵測high link utilzation跟server down
- 定期從control-plane poll資訊會太慢
- 在switch的port-meter上設bandwidth+QoS threshold，太低就從別的port egress

Limitation

超過Server端網卡的RDMA read buffer時，會直接drop packet => 最好能把read request導到別台DRAM server
多server的TEA-Table lookup仍不夠有效率
只能用exact-match flow rule, 不能用longest-prefix match之類的

Evaluation & Conclution

相較傳統的NF on commodity server, TEA-enabled NF 提供 9.6× throughput (7.3–10.9M lookups per second) and 3.1× lower latency (1.8–2.2μs)
Switch SRAM cache在skewed workload下，可減少將近一半的 DRAM server access request，藉此能達到load balance
現實世界的封包流量反而比較傾向skewed workload, 而不是uniform distributed, 這導致cache更加重要
TEA-Enabled NF的latency除了低之外，也很穩定，不會隨packet size有太大起伏，都在2μs左右

Architecture

NEXTInterpreting Deep Learning-Based Networking Systems