0 Abstract
- Hyperloop利用RDMA將memory operation offload到NIC,
 - 透過避開CPU達到加速replicated transaction的目標
 
1 Introduction
- Replicated Transaction
    
- DB為了確保availability跟durability,會將data複製成多份存在各處
 - Replicated Transaction就是將這些data同時update(同步),確保consistancy
 - 會涉及CPU及locking, 故latency高且不穩定
 
 - 現有解法(如kernel bypass, 傳統RDMA),僅適用於單一storage system上的RT
    
- 在多replica時,需要CPU I/O polling
 - CPU會參與transaction每個步驟
 - 故不適用 Multi-Tenant Storage Systems
 
 - HyperLoop 創新點
    
- RDMA WAIT
        
- buildin cmd但少用
 - 當接收到特定的op後,wait才會activate我們pre-post的RDMA ops
 - 可避免CPU polling
 
 - 可以主動update其他NIC的work queue entry的內容
        
- 要修改NIC driver
 
 
 - RDMA WAIT
        
 
2 Background & Issues
- 簡介 Replicated Transaction
    
- 單一transaction update流程(atomic):
        
- 
            
- 將X,Y的修改寫到log(memory)
 
 - 
            
- lock X, Y
 
 - 
            
- commit(flush) log to NVM
 
 - 
            
- Release lock
 
 
 - 
            
 - 確保data update到足夠多份的replica之後,才會通知user
 - Two-phase commit: consensus protocol, 決定replicas何時要flush
 - Chain Replication
        
- 一種Two-phase commit, Hyperloop的主要優化目標
 - 
            
- chain上的replica先照順序寫log
 
 - 
            
- Last replica寫完log後,往回傳ACK,代表各個replica都準備好要commit了
 
 - 
            
- 每個replica收到ACK後就可commit
 
 - 
            
- head收到ACK,代表Replicated Transaction作完
 
 
 
 - 單一transaction update流程(atomic):
        
 - Multi-tenancy Causes High Latency
    
- Busy CPU
        
- 傳統作法是每份replica per-process
 - process負責LOCK:從network stack撈log,再將log存到storage stack
 - 但在Multi-tenant中,一台server會切成不同用戶的replica,占用過多process
 - CPU heavd load, many context switches => HIGH Latency
 
 - 現有解法的不足
        
- user-level TCP, 傳統RDMA只offload network stack到NIC
 - 關鍵的storage replication, transaction還是卡在CPU
 - 避免context switch而用的core-pinning, 更不適合用在multi-tenant
            
- 因 replica數(partition數) »> core數
 
 
 
 - Busy CPU
        
 
3 HyperLoop Overview
3-1 Design Goals
- No replica CPU involvement on the critical path
 - Provide ACID operation
 - End-host only implementation based on commodity hardware
    
- 只要動client跟replica的硬體,不用動到switch
 
 
3-2 HyperLoop Architecture

- Net primitive lib: 實作四個group network primitives(by RDMA)
    
- 提供Replicated Transactions的寫入/同步等
 - 不涉及Replica的CPU
 
 - RDMA NIC:
    
- 接收上個NIC來的封包後,執行對應的memory ops, 再forward封包到下一個replica
 
 
4 HyperLoop Design
4-1 Key Ideas
- 傳統的RDMA為什麼需要CPU Polling
    
- Client -> Replica#1: 用RDMA直接送,不用經CPU
 - Replica#1 -> Replica#2: CPU Polling
        
- 因要forward給#2的東西,是根據client送來的而定
 - 故#1會pre-post recv request,再定時polling自己的complete queue
 - 確定client送完東西後,才能決定何時要forward哪些東西給#2 (when and what)
 
 - 照此道理,每個replica的CPU都必須做polling,等到上一個replica送完東西後,才知道要何時送哪些東西給下個replica
 - 故Hyperloop便讓replica自己偵測recv complition,並自動forward對應event到下一個replica,藉此避開polling
        
- => RDMA WAIT
 
 
 - RDMA WAIT (When)
 
    - 每個replica都要pre-post RECV跟WAIT跟Operation
 - Operation會被WAIT block住
 - WAIT被RECV trigger後,會activate blocked operation
 - Operatiun仍未被決定(fixed replication)
 
 - Remote Work Request Manipulation (What)
 
    - 根據收到的封包,決定forwarded operation的內容
 - 必須修改網卡driver,讓client(or replica)可以修改replica的WQ裡, 特定pre-posted work request的內容
        
- 因為WQ也是一段memory,故client應該要可以透過RDMA修改其內容
 - 根據收到的pre-calc. metadata來改(包含各replica memory info)
 
 - 修改完後,wait會activate這些WR,並forward出去
 - 會占用多餘的WR送metadata跟WAIT,但影響不大
 
 - Integration with other RDMA operations to support ACID
 
4-2 Detailed Primitives Design
- 利用以上概念,寫出四隻API提供Replicated Transactions所需操作
    
- Group write (gWRITE)
 
- Allows client to write data to memory regions of a group of remote nodes without involving their CPUs
 - 提供寫入log (replicated transaction log management)
        
- Group compare-and-swap (gCAS):
  

 
 - Group compare-and-swap (gCAS):
  
 - 讓各replica比較給定的數值,決定要不要swap數值
 - 提供group lock
        
- Group memory copy (gMEMCPY)
 
 - Copying the data size of size from 
src_offsettodest_offsetfor all nodes - NICs will copy the data from the log region to the persistent data region without involving the CPUs
 - Remote log processing
        
- Group RDMA flush (gFLUSH)
 
 - Supports the durability at the “NIC-level”
 - 把NIC cache裡的data, flush到NVM(SSD)
 
 
5 Case Study
Skip