1RMA: Re-envisioning Remote Memory Access forMulti-tenant Datacenters

1 Background & Issues

  • 傳統的RDMA並不適用於Multi-tenant datacenter
    • 使用過多連線
      • 基於InfiniBand RDMA的特性, NIC會cache住host-remote每一條連線的state
      • 連線是per app-pair, 在大流量環境會占用過多NIC cache
    • Induced Ordering
      • 當多筆request要共用同一條connection時
      • 由於RDMA要求FIFO方式處理,可能發生head-of-line blocking,導致高priority request陷入stravation
    • Security Issue
      • 加密完全交由NIC處理,app不能快速做key rotation
    • Hardware Congestion Control
      • Switch必須用 Priority Flow Control(PFC),提供RDMA lossless環境
        • 但PFC會有deadlock, poor failure isolation, head-of-line blocking
      • 寫死在網卡,deploy後很難手動調整
    • Error Handling
      • 當斷線時,client不知道server端的RDMA op到底有沒有改道remote memory

2 Introduction

  • 1RMA重新設計劃分網卡與軟體間的責任,把原本全部都由NIC完成的工作,拆一些出來給軟體做
    • 硬體專注RMA read/write, encrypt
    • 軟體負責CC, op pacing, timeout policy
  • 設計目標:
    1. No Connections
      • 不用cache connection state
      • 每個op都視為獨立 => per-op retry/fail-recovery
    2. Small-sized ops, solicitation based
      • Hardware solicitation window to prevent TCP incast
    3. Software Congestion Control
      • dynamic
    4. Software-defined resource allocation
      • 不像傳統RDMA為了滿足lossless的網路環境,而必須要在網卡上cache一堆state
      • 透過priority決定要分配給request多少資源
    5. First-Class Security
      • 讓app有權限做key rotation
      • 每個memory region用不同key保護

3 1RMA Overview

  • Step1~2 Get $K_d$, $RegionID$
  • Step3 Solicitation window有空位的時候,才能issue request到NIC
    • SW: chunk, SW->NIC的速率、Congestion Control (Slow)
    • HW: 用Soli. window做Admission Control (Fast)
  • Step4 $K_d$ sign request
  • Step6 $K_d$ encrypt response

4 1RMA Design In Depth

4-1 Security

  • Derived Key
    • $K_d:$ Session Key, per-process, 用來sign跟(encrypt)
      • client拿到$K_d$後才有資格做RDMA request
    • $K_r:$ Region Key, per-memory-block, 存在remote NIC
    • 可防以下攻擊
      1. Replay attack: 因server生$K_d$有加salt
      2. inject ciphertext: 沒$K_d$無法decrypt
      3. Access other’s remote memory: 可猜出RegionID猜不出$K_d$

4-2 Hardware

  • RRT: static table, 存RegionID, Kr對應到的memory range
  • CST: single in-flight operation
  • Solicitaion Window: Admission Control, 限制FIFO中多少packet能進入網路
  • Number of memory regions for RMA based on tasks, not task-pairs
    • manageable in finite resources
  • Timeout: 等太久都沒進入window就直接timeout
    • 避免head-of-line blocking, 提供congestion signal

4-3 Software

  • CommandExecutor: chunking, CC, pacing
  • ComamndPortal: App. memory <—> NIC register mapping

5 Other 1RMA Ops

5-1 RMA Write

  • 要求Remote對local做RMA read
  • Con: 多花一個RTT
  • Pro:
    • 機制可以沿用RMA read的,不用重新設計
    • client會比remote晚timeout
      • 可避免斷線時,client不知道write remote memory到底有沒有成功

5-2 Rekey

  • 用RMA write做key rotation
  • 成本低:Install a new region key 𝐾𝑟 in 1 RRT
  • 傳統方法問題
    • High transient connection usage: 要先建用新key的連線
    • Bursts of connection failure: 換key的時候會瘋狂auth fail

6 Congestion Control

  • SW: 主要做CC運算
    • HW: 偵測delay提供給SW
  • Delay當作CC指標
  • Connection-free 1RMA, 故可以只在request端做CC就好
    • per remote NIC
  • Congestion Window
    • Remote: 每張remote NIC/direction 一個
    • Local: 共用同一個
  • Local CC Algo
  • ops’ issue rate = $\frac{OpSize*CWND}{RTT}$