0 Abstract
- Hyperloop利用RDMA將memory operation offload到NIC,
- 透過避開CPU達到加速replicated transaction的目標
1 Introduction
- Replicated Transaction
- DB為了確保availability跟durability,會將data複製成多份存在各處
- Replicated Transaction就是將這些data同時update(同步),確保consistancy
- 會涉及CPU及locking, 故latency高且不穩定
- 現有解法(如kernel bypass, 傳統RDMA),僅適用於單一storage system上的RT
- 在多replica時,需要CPU I/O polling
- CPU會參與transaction每個步驟
- 故不適用 Multi-Tenant Storage Systems
- HyperLoop 創新點
- RDMA WAIT
- buildin cmd但少用
- 當接收到特定的op後,wait才會activate我們pre-post的RDMA ops
- 可避免CPU polling
- 可以主動update其他NIC的work queue entry的內容
- 要修改NIC driver
- RDMA WAIT
2 Background & Issues
- 簡介 Replicated Transaction
- 單一transaction update流程(atomic):
-
- 將X,Y的修改寫到log(memory)
-
- lock X, Y
-
- commit(flush) log to NVM
-
- Release lock
-
- 確保data update到足夠多份的replica之後,才會通知user
- Two-phase commit: consensus protocol, 決定replicas何時要flush
- Chain Replication
- 一種Two-phase commit, Hyperloop的主要優化目標
-
- chain上的replica先照順序寫log
-
- Last replica寫完log後,往回傳ACK,代表各個replica都準備好要commit了
-
- 每個replica收到ACK後就可commit
-
- head收到ACK,代表Replicated Transaction作完
- 單一transaction update流程(atomic):
- Multi-tenancy Causes High Latency
- Busy CPU
- 傳統作法是每份replica per-process
- process負責LOCK:從network stack撈log,再將log存到storage stack
- 但在Multi-tenant中,一台server會切成不同用戶的replica,占用過多process
- CPU heavd load, many context switches => HIGH Latency
- 現有解法的不足
- user-level TCP, 傳統RDMA只offload network stack到NIC
- 關鍵的storage replication, transaction還是卡在CPU
- 避免context switch而用的core-pinning, 更不適合用在multi-tenant
- 因 replica數(partition數) »> core數
- Busy CPU
3 HyperLoop Overview
3-1 Design Goals
- No replica CPU involvement on the critical path
- Provide ACID operation
- End-host only implementation based on commodity hardware
- 只要動client跟replica的硬體,不用動到switch
3-2 HyperLoop Architecture
- Net primitive lib: 實作四個group network primitives(by RDMA)
- 提供Replicated Transactions的寫入/同步等
- 不涉及Replica的CPU
- RDMA NIC:
- 接收上個NIC來的封包後,執行對應的memory ops, 再forward封包到下一個replica
4 HyperLoop Design
4-1 Key Ideas
- 傳統的RDMA為什麼需要CPU Polling
- Client -> Replica#1: 用RDMA直接送,不用經CPU
- Replica#1 -> Replica#2: CPU Polling
- 因要forward給#2的東西,是根據client送來的而定
- 故#1會pre-post recv request,再定時polling自己的complete queue
- 確定client送完東西後,才能決定何時要forward哪些東西給#2 (when and what)
- 照此道理,每個replica的CPU都必須做polling,等到上一個replica送完東西後,才知道要何時送哪些東西給下個replica
- 故Hyperloop便讓replica自己偵測recv complition,並自動forward對應event到下一個replica,藉此避開polling
- => RDMA WAIT
- RDMA WAIT (When)
- 每個replica都要pre-post RECV跟WAIT跟Operation
- Operation會被WAIT block住
- WAIT被RECV trigger後,會activate blocked operation
- Operatiun仍未被決定(fixed replication)
- Remote Work Request Manipulation (What)
- 根據收到的封包,決定forwarded operation的內容
- 必須修改網卡driver,讓client(or replica)可以修改replica的WQ裡, 特定pre-posted work request的內容
- 因為WQ也是一段memory,故client應該要可以透過RDMA修改其內容
- 根據收到的pre-calc. metadata來改(包含各replica memory info)
- 修改完後,wait會activate這些WR,並forward出去
- 會占用多餘的WR送metadata跟WAIT,但影響不大
- Integration with other RDMA operations to support ACID
4-2 Detailed Primitives Design
- 利用以上概念,寫出四隻API提供Replicated Transactions所需操作
- Group write (gWRITE)
- Allows client to write data to memory regions of a group of remote nodes without involving their CPUs
- 提供寫入log (replicated transaction log management)
- Group compare-and-swap (gCAS):
- 讓各replica比較給定的數值,決定要不要swap數值
- 提供group lock
- Group memory copy (gMEMCPY)
- Copying the data size of size from
src_offset
todest_offset
for all nodes - NICs will copy the data from the log region to the persistent data region without involving the CPUs
- Remote log processing
- Group RDMA flush (gFLUSH)
- Supports the durability at the “NIC-level”
- 把NIC cache裡的data, flush到NVM(SSD)
5 Case Study
Skip