大同 Work Notes: PCI Express: Uncorrectable AER

在之前，我撰寫了淺談PCI Express: Advanced Error Reporting(AER)文章介紹了整個Advanced Error Report(AER)的錯誤回報機制，文中提到AER主要可以分為Uncorrectable Errors和Correctable Errors。

此篇文章主要描述最近機台上遇到的AER Case，那就是Uncorrectable Error中的Surprise Down，主要是透過LeCroy PCIe Analyzer截錄的LTSSM狀態機來分析，最終找到root cause。

■ Issue Description & Reproduce

1. 安裝了2張M.2 NVMe SSD到機台上的M.2 slot.

2. 將2張M.2 NVMe SSD組成RAID 1 Volume

3. 對Volume執行fio後，其中一張M.2 SSD會消失

Kernel Log會回報NMI和strange power saving mode的log:

[ 476.609222] Uhhuh. NMI received for unknown reason 3d on CPU 0.

[ 476.609223] Do you have a strange power saving mode enabled?

[ 476.609223] Dazed and confused, but trying to continue

[ 476.609224] Uhhuh. NMI received for unknown reason 3d on CPU 0.

[ 476.609224] Do you have a strange power saving mode enabled?

[ 476.609225] Dazed and confused, but trying to continue

[ 477.411211] atlantic: link change old 1000 new 10000

M.2 NVMe SSD會down speed到PCIe Gen2，且NVMe Driver會probe fail而把NVMe device removed，log如下:

[ 506.723242] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10

[ 506.748410] nvme nvme1: change state from 1 to 3

[ 506.761228] nvme 0000:03:00.0: enabling device (0000 -> 0002)

[ 506.767226] nvme nvme1: Removing after probe failure status: -19

[ 506.779217] nvme1n1: detected capacity change from 250059350016 to 0

[ 506.785872] print_req_error: I/O error, dev nvme1n1, sector 1060256

[ 506.785877] nvme nvme1: nvme_remove(2551): pci function 0000:03:00.0 kref:3

[ 506.785880] print_req_error: I/O error, dev nvme1n1, sector 143863808

[ 506.785881] nvme nvme1: change state from 3 to 5

Device的PCI Configuration Space的Memory BAR會被清除:

出問題前

$ lspci -s 03:00.0 -xxx

03:00.0 Non-Volatile memory controller

00: 4d 14 08 a8 46 05 10 00 00 02 08 01 10 00 00 00

10: 04 00 c0 df 00 00 00 00 00 00 00 00 00 00 00 00

20: 00 00 00 00 00 00 00 00 00 00 00 00 4d 14 01 a8

30: 00 00 00 00 40 00 00 00 00 00 00 00 0a 01 00 00

出問題後

$ lspci -s 03:00.0 -xxx

03:00.0 Non-Volatile memory controller

00: 4d 14 08 a8 02 00 10 00 00 02 08 01 00 00 00 00

10: 04 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

20: 00 00 00 00 00 00 00 00 00 00 00 00 4d 14 01 a8

30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00

Root Port AER Status registers反應了發生過Uncorrectable Error type "Surprise Down(SDES)"和Correctable Error type "Receiver Error(RxErr)":

$ lspci -s 00:0a.0 -vvvv

00:0a.0 PCI bridge: Intel Corporation Device 19a5 (rev 11) (prog-if 00 [Normal decode])

Capabilities: [100 v1] Advanced Error Reporting

UESta: DLP- SDES+ TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol+

UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-

UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-

CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-

CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+

■ LeCroy PCIe Analyzer

Figure1&2為透過analyzer截錄Root Port與M.2 SSD之間的LTSSM狀態機，其中名詞Downstream為PCIe Root Port、Upstream為M.2 NVMe SSD。

首先看到Figure 1， LTSSM的狀態機，在時間點(1)的時候，M.2 SSD在Gen3 Speed(8.0)發出了一個奇怪的TLP Packet，Root Port收到後覺得link有問題，隨即在時間點(2)進入Recovery State想要重新handshake。

Figure 1.

接下來，如Figure 2，在時間點(3)的時候，M.2 SSD突然從Gen3退回到Gen1並且開始傳輸Gen1 format的Training Sequence(TS1、TS2)，此時Root Port還在用Gen3的語調在傳輸Training Sequence。

由於雙方已不再同一個速度下傳輸TS packet，所以Root Port已經看不懂device所講的話，在時間點(4)的時候讓link回到Polling State重新retrain link。

Figure 2.

由於時間點(3)已經暗示M.2 SSD可能因為電位變化而導致link回到Polling State，所以測量了一下發生問題的時候電壓的變化時如何，如Figure 3，當fio指令執行下去時，M.2 SSD的3.3V源頭，從3.3V掉到了2.76V(約17%)，因而導致M.2 SSD瞬間掉電而發出error format TLP而且退回到Polling State。

Figure 3.

■ Conclusion

單純從LTSSM和AER registers上分析，如果看到時間點(4)的這種Root Port LTSSM從Gen3 L0 State -> Gen3 Recovery State -> Gen1 Polling State 的這種路徑轉移，大致上就可以認為是Uncorrectable Error中的Surprise Down。

6 則留言:

L.J.2021年11月17日晚上7:42
Blog主，你好，
我最近也在閱讀PCIe的Spec文件(我是用PCIe 4.0)，
跟著您的思路走，
我發現到有幾個問題不太懂，所以想請教一下

UESta:裡面的TLP-指的是哪種錯誤？
我看Spec裡面跟TLP相關又是Uncorrectable有
1.MC Blocked TLP
2.TLP Prefix Blocked
3.Poisoned TLP Received
4.Poisoned TLP Egress Blocked

我不知道哪一個才是正確的

那下一個問題是想問
CESta:裡的AdvNonFatalErr-又是指哪一種錯誤？
我看Correctable Error的項目沒有出現A開頭的錯誤，
這是指Advisory Non-Fatal Error嗎？

最後一個問題就是：
在Uncorrectable Error裡面有一個AtomicOp Egress Blocked，
這個是不是沒有在UESta:裡面？
回覆刪除
回覆
大同 Work Notes2021年11月17日晚上9:02
Hi L.J.

lspci為了方便閱讀都用簡寫，但他其實有照順序從bit0開始列到bit31
問題1: UESta的TLP-代表bit12 Poisoned TLP Received Status，也就是你所列的第3個error type

問題2: 由於我這M.2是Gen3的Device，所以要用Gen3的視野去看UESta和CESta，你所列的這1,2,4的error好像是Gen4才出現的。
1.MC Blocked TLP
2.TLP Prefix Blocked
4.Poisoned TLP Egress Blocked

問題3:CESta裡的AdvNonFatalErr-是Advisory Non-Fatal Error沒錯

問題4: 如問題2，AtomicOp Egress Blocked好像也是Gen4才出現的error種類
回覆刪除
回覆
Jheng2022年7月26日晚上8:08
Blog主你好，
想請問一下，那麼這題的最後解法是怎麼解的呢?
回覆刪除
回覆

大同 Work Notes

2021年3月5日星期五

PCI Express: Uncorrectable AER - Surprise Down Case Study

6 則留言:

解析 NVM Express - 透過Linux OS 解析M.2 NVMe SSD

2021年3月5日 星期五

PCI Express: Uncorrectable AER - Surprise Down Case Study

6 則留言:

解析 NVM Express - 透過Linux OS 解析M.2 NVMe SSD

2021年3月5日星期五