Interpreting Deep Learning-Based Networking Systems

1 Background & Issues

DNN models涉及數百萬個以上的參數與neurons，當運用在網路系統上時，對網管人員來說像是黑盒子，會遇到以下問題
- DNN架構難以理解，造成結果不如預期時很難debug
- 非常難針對網路狀態做手動的即時調整policy
- 太耗運算資源，deploy時要考慮硬體能力，decision latency也很高
因此我們需要一套framework可以interpret網路系統中DNN model，方便網管人員動態調整網路traffic

2 Motivation (現有解法的不足/設計上的挑戰)

現在也是有幾套針對DL-based network system的interpretion tool，但它們在幾個面向上仍有所不足

不夠General，無法通用於所有Network systems
- 現有的詮釋工具只能適用於某一種DL模型
  - 像是LEMMA這套工具，只能用於詮釋RNN-based network system
- 但網路系統可能是FNN, GNN, RNN 等等
詮釋重點放在理解DNN模型的內部機制
- e.g. 找出內部neurons會被哪些input features激發
- 但就網管人員的角度，詮釋重點是得知input和output之間的關係
Non-standard state and action spaces
- 現有較成熟的DNN interpretion tools，主要針對圖片或句子這種well-structured vector input
- 網路系統的DL model，input可能是topology，output可能是path，encode方式仍不夠成熟，並不適合將上述工具搬來用

3 Introduction

Metis: 一套framework，可以將各種DL-based networking system詮釋成人類可理解的control policy
不受限於特定DL模型:
- 將Network system分成兩類，Local System和Global System
Local System
- 只在一台機器上收集資訊並下決定
- 透過teacher-student training(知識蒸餾)，將DNN模型轉化成decision tree(rule-based)
  - 設計特殊剪枝方法，將decision tree的leaf數壓在人類能理解的範圍:
    - Resample: DNN policy的每一組data(input, output)重要性並不同，這邊採用RL的advange function決定每組data的sample機率
    - 也能依據網管人員的需求，動態決定剪枝程度
Global System
- 在整個網路拓樸上收集資訊並下決定
- 將整個問題formulate成一張hypergraph
  - 因為網路問題通常有graph input
  - 或是建立2 variables mapping (e.g. resource allocation)
- Metis能將原先的網路問題，轉化成計算hypergraph上重要的component的最佳化問題
  - e.g. 計算SDN上的最佳路徑
    - Vertice: Physical Link
    - Hyperedge: Path(src-dst pair)
    - 新問題: 哪些vertice-edge connection，對整個網路拓樸的最佳routing解是有重要影響的

4 Decision Tree Interpretation

用Decision Tree的原因
1. Local Network System通常用rule-based policies做決定，這很像decision tree做決定的邏輯
  - e.g. switch的flow scheduling通常是以一些forwarding rules(SJF)決定順序
2. 可以表示很複雜的policy，雖然leaf會變很多個
3. Lightweight, 占用運算資源少，latency較低 => 可以deploy到SmartNIC、switch data plane
Conversion Methodlogy
1. Traces collection
  - 不能把DNN的所有(state, action)都直接拿去train decision tree
    - 太花時間
    - 不符合target policy的狀態分布
    - 每組(state, action)對於要優化的target policy有不同程度的影響
  - ? Metis follows the trajectories所有(state, action) generated by the teacher DNNs
  - ? Let DNN policy take over the control on the deviated trajectory, and re-collect (state, action) pair to refine the conversion training
2. Resampling
  - decision tree的優化目標是single action的accuracy，再用同個方式對待其他action
  - 現存的DL-based Network system通常以RL的方式優化policy為主，而不是優化各action
    - 各組(state, action)的advantage function代表其對優化目標的重要程度
  - 可以看出RL和decision tree的優化目標不同，故不適合將每一組(state, action)都平等對待，通通丟到decision tree裡做training
  - paper認為應該要依據各組(state, action)的重要程度，決定這組(state, action)要被resample起來，丟到decision tree裡train的機率
3. Pruning
  - 利用cost complexity pruning(CCP)，透過一個cost function決定當前node的sub-tree大小，藉此決定要不要保留該Node
  - 對於較大的網路環境，改成使用regression tree
4. Deployment
  - 可以直接deploy轉換後的decision tree，performance diff. < 2%

5 HYPERGRAPH INTERPRETATIONS

Hypergraph:
- Vertex, Hyperedge各表示一種變數
- e.g. SDN routing optimization
  - vertex: Physical Link
  - hyperedge: Path(src-dst pairs)
適用範圍, 有其中一種就OK
1. Graph-structured inputs or outputs
2. Bivariate mapping
  - Resource allocation problem (reource-request)
  - e.g. NF placement, UE-Base station coverage
目標
- 找出hypergraph哪些vertex-hyperedge connections, 會對優化目標有重要影響
- e.g. SDN routing optimization當中，哪些(link, path) connection對整體routing result有重要影響
符號定義
- Incidence matrix I:
  - 表示vertex-hyperedge connections連線狀況, 只有0,1
- Fractional Incidence matric W:
  - 表示vertex-hyperedge connections連線重要程度， 0~1之間小數
Formulation
- 目標
  - 找到一組W，使得loss function最小
- Performance degradation $D(Y_W, Y_I)$
  - $Y_W$: input不變，原先DL model的output
  - $Y_I$: input用W matrix masked，再丟到DL model後的output
  - $D(Y_W, Y_I)$: 計算兩者間的similarity，依據output類型用不同公式評估
    - 連續: MSE
    - 離散: KL-divergence (e.g. sequences of routing decisions)
- Interpretation conciseness $ W $
  - penalty, 避免取太多個critical connections
  - 如果取W=I的話，雖然 $D(Y_W, Y_I)$會很小，但這一項會變很大
- Determinism $H(W)$
  - 希望W matrix的結果是determistic的，$W_{ev}$的值盡量接近0或1
  - 計算$W_{ev}$的entropy，entropy越小表示訊息含量越低，$W_{ev}$也就越determistic(接近0 or 1)
- Hyperparameters $𝜆_1, 𝜆_2$
  - 供網管人員需求動態調整
  - $𝜆_1$調高 => penalty變高 => critical conn.減少
  - $𝜆_2$調高 => $W_{ev}$越趨近0或1 => critical conn.變多

6 EXPERIMENTS

System interpretations
Guide for model design
Enabling debuggability
Lightweight deployment
Ad-hoc adjustments

7 Discussion & Conclution

Architecture

PREVIOUSTEA: Enabling State-Intensive Network Functions on Programmable Switches

NEXTCaching with Delayed Hits