1RMA: Re-envisioning Remote Memory Access forMulti-tenant Datacenters
1 Background & Issues
- 傳統的RDMA並不適用於Multi-tenant datacenter
- 使用過多連線
- 基於InfiniBand RDMA的特性, NIC會cache住host-remote每一條連線的state
- 連線是per app-pair, 在大流量環境會占用過多NIC cache
- Induced Ordering
- 當多筆request要共用同一條connection時
- 由於RDMA要求FIFO方式處理,可能發生head-of-line blocking,導致高priority request陷入stravation
- Security Issue
- 加密完全交由NIC處理,app不能快速做key rotation
- Hardware Congestion Control
- Switch必須用 Priority Flow Control(PFC),提供RDMA lossless環境
- 但PFC會有deadlock, poor failure isolation, head-of-line blocking
- 寫死在網卡,deploy後很難手動調整
- Error Handling
- 當斷線時,client不知道server端的RDMA op到底有沒有改道remote memory
2 Introduction
- 1RMA重新設計劃分網卡與軟體間的責任,把原本全部都由NIC完成的工作,拆一些出來給軟體做
- 硬體專注RMA read/write, encrypt
- 軟體負責CC, op pacing, timeout policy
- 設計目標:
- No Connections
- 不用cache connection state
- 每個op都視為獨立 => per-op retry/fail-recovery
- Small-sized ops, solicitation based
- Hardware solicitation window to prevent TCP incast
- Software Congestion Control
- Software-defined resource allocation
- 不像傳統RDMA為了滿足lossless的網路環境,而必須要在網卡上cache一堆state
- 透過priority決定要分配給request多少資源
- First-Class Security
- 讓app有權限做key rotation
- 每個memory region用不同key保護
3 1RMA Overview
- Step1~2 Get $K_d$, $RegionID$
- Step3 Solicitation window有空位的時候,才能issue request到NIC
- SW: chunk, SW->NIC的速率、Congestion Control (Slow)
- HW: 用Soli. window做Admission Control (Fast)
- Step4 $K_d$ sign request
- Step6 $K_d$ encrypt response
4 1RMA Design In Depth
4-1 Security
- Derived Key
- $K_d:$ Session Key, per-process, 用來sign跟(encrypt)
- client拿到$K_d$後才有資格做RDMA request
- $K_r:$ Region Key, per-memory-block, 存在remote NIC
- 可防以下攻擊
- Replay attack: 因server生$K_d$有加salt
- inject ciphertext: 沒$K_d$無法decrypt
- Access other’s remote memory: 可猜出
RegionID
猜不出$K_d$
4-2 Hardware
- RRT: static table, 存
RegionID
, Kr
對應到的memory range
- CST: single in-flight operation
- Solicitaion Window: Admission Control, 限制FIFO中多少packet能進入網路
- Number of memory regions for RMA based on tasks, not task-pairs
- manageable in finite resources
- Timeout: 等太久都沒進入window就直接timeout
- 避免head-of-line blocking, 提供congestion signal
4-3 Software
- CommandExecutor: chunking, CC, pacing
- ComamndPortal: App. memory <—> NIC register mapping
5 Other 1RMA Ops
5-1 RMA Write
- 要求Remote對local做RMA read
- Con: 多花一個RTT
- Pro:
- 機制可以沿用RMA read的,不用重新設計
- client會比remote晚timeout
- 可避免斷線時,client不知道write remote memory到底有沒有成功
5-2 Rekey
- 用RMA write做key rotation
- 成本低:Install a new region key 𝐾𝑟 in 1 RRT
- 傳統方法問題
- High transient connection usage: 要先建用新key的連線
- Bursts of connection failure: 換key的時候會瘋狂auth fail
6 Congestion Control
- SW: 主要做CC運算
- 用Delay當作CC指標
- Connection-free 1RMA, 故可以只在request端做CC就好
- Congestion Window
- Remote: 每張remote NIC/direction 一個
- Local: 共用同一個
- Local CC Algo
- ops’ issue rate = $\frac{OpSize*CWND}{RTT}$