大同 Work Notes: 原理PCI Express: Advanced Error Reporting(AER)

2020年5月25日星期一

原理PCI Express: Advanced Error Reporting(AER)

Advanced Error Report(簡稱AER)為PCIe 較為強健的錯誤回報機制，在PCI Express®Base Specification Revision 3.0 的6.2節"Error Signaling and Logging"章節有詳細的介紹整個AER的logging與repoting機制，由於工作上常常遇到Linux Kernel log中，出現AER error訊息，所以又把spec拿出來翻一翻，藉此寫一篇關於AER的機制，方便以後查詢，以下就是我對於spec的理解與淺見，如果有誤的地方還望指正。

■ Error Classification

PCIe的error主要可以分為Uncorrectable Errors和Correctable Errors，Uncorrectable Errors又可以進一步分為類為Fatal和Non-Fatal。

Correctable Errors: 這類的errors表示hardware可以自我修復，且不需要軟體的介入。舉例來說，當Receiver接收到TLP封包後，會在Data Link Layer進行LCRC檢查，當檢查發生錯誤，Receiver會initial NAK DLLP來告知Transmitter，此時Transmitter會再重新傳送發生錯誤的TLP，這就是透過retry機制來recover PCIe link。
Uncorrectable Errors: 這類的errors分為為Fatal和Non-Fatal，Fatal Error表示hardware無法自我修復，此時必須reset整個link回到最初的狀態，重新link training。Non-Fatal則是提供給System software，有機會不透過reset link來recover error。

■ Error Signaled

Error 傳送的方法可以分為三種 (1) Completion status, (2) Error Message和 (3) Error Forwarding，由於以下篇幅都是介紹第二種方法居多，1和3就不再贅述。

Error Message是透過Transaction Layer章節裡面所定義的Message Request的形式來傳送(如下圖)。

Error Message分為ERR_COR, ERR_NONFATAL和ERR_FATAL，分別對應Correctable Error、Non-Fatal Uncorrectable Error和Fatal Uncorrectable Error。而這些message要怎麼分別呢?主要由Code欄位和Routing欄位來區分，根據下圖每一個error message都有固定的code encoding，然後routing欄位固定為000b，代表這些message的都是Routed to Root Complex。

■ Error logging

如果device沒有support AER capability，產生error只會記錄在Device Status register來代表有error被偵測到。如果有support AER capability，則會將error記錄在Uncorrectable Error Status register和Correctable Error Status register上，這兩個status register可以提供software去更精準的分辨問題種類和嚴重程度(severity) 。這兩個register分別都有對應的Mask register，可以更細部的去控制任何一種error type是否往上report Error Message。

■ Error Source Identification

如果是Root Port 或是 Root Complex Event Collector這類的device會收集他分支下面的device所產生的errors，就必須要有Error Source Identification register來記載這這些device的source id(由bus, device, function number所組成)，如下圖所示，bit 0-15為Root Port(or Root Complex)第一次接收到correctable error的device source id，bit 16-31則為第一次接收到uncorrectable error的device source id。

另外Root Port或Root Complex Event Collector 如果support AER的話，就必須要實做"Root Error Status register"，這個register用來顯示是否有接收到device產生或自己內部產生的correctable和uncorrectable error。

■ Interrupt Generation

Root Error Command Register 可以用來控制Uncorrectable Error和Correctable Error的reporting是否要產生Interrupt。

而產生interrupt的方式有兩種，一種是透過I/O APIC，另外一種則是透過MSI、MSI-X。

當透過I/O APIC產生interrupt必須要滿足兩個條件，

PCI Configuration Space裡的Command register的Interrupt Disable bit必須為0(表示enable intx)
Root Error Command register裡面有任何一個bit被設置為1(表示接收到底下的device所傳送error message)

當透過MSI、MSI-X來產生interrupt，也必須滿足兩個條件，

所使用的Interrupt Vector並沒有被masked，而MSI/MSI-X所使用的interrupt vector會被program在Root Error Status register[31:27]
Root Error Command register裡面有任何一個bit被設置為1

當以上兩種方法的兩個條件都符合後，Root Port會在error發生時產生interrupt送給CPU，然後執行Error Report的ISR(Interrupt Service Routine)。

■ Advisory Non-Fatal Error Logging

在某些情況下，偵測到error的device並不適合去決定這個error是否沒有辦法被recover。舉例來說，當system software發起了一個configuration read給不存在的device，這個request會得到completion(TLP) with Unsupported Request (簡稱UR) bit set，並且發起error通知system software，但是software並不需要為了這種情況發起ERR_NONFATAL的message，而且在某些platform發起ERR_NONFATAL可能會造成系統崩潰。

因此，PCIe spec定義了Advisory Non-Fatal Error的logging機制，讓某些ERR_NONFATAL的情況，使用ERR_COR替代。

要如何才能符合Advisory Non-Fatal Error Logging的條件，主要有三點:

Uncorrectable Error的severity必須要是non-fatal，也就是對應的Uncorrectable Error Severity register bit要為0(如下圖)
Error的case必須符合6.2.3.2.4 Advisory Non-Fatal Error Case的其中一個，由於case太多就沒有一一介紹
Correctable Error Status register的Advisory Non-Fatal Error Status bit被設為1，表示有Advisory Non-Fatal Error的產生，且Mask bit in Mask register沒有被設置，這樣就能符合Advisory Non-Fatal Error Logging的條件

■ Device Error Signaling and Logging 流程圖:

下圖為整個Error Reporting的機制，主要分為三個部份來介紹，

1. 藍色框框所框起來的區域，代表如果沒有support AER Capabilities，那error僅僅會記錄在Device Status Register

2.圖中紅色箭頭的部份，為Uncorrectable Error的流程

(1)查詢這個error在Uncorrectable Error Severity register對應的bit被設定為fatal或non-fatal，並且將Device Status Register的fatal或non-fatal bit設為1，表示error的發生。

(2)如果error符合advisory non-fatal error case章節的case的話，則使用ERR_COR message替代掉ERR_NONFATAL來report error。
(3)如果error是Unsupported Request(UR)，則在Device Status Register的UR bit設為1來代表UR的發生。
(4) 設置對應的Error Status bit到Uncorrectable Error Status register。
(5) 如果Uncorrectable Error Mask register對應的bit被設置，代表不會往上回報error message TLP。
(6) 這個步驟有幾個開關必須要被打開，才會啟動report機制，

首先為Device Control register的Unsupported Request Reporting Enable bit，這個bit只有控制UR的report or not。
第二為Command Register的SERR# Enable bit。
第三為Device Control register的fatal error report enable bit和non-fatal error report enable bit。

3. 圖中橘色箭頭的部份，為Correctable Error的流程

(1)如果沒有support AER capabilities，則只會設置Device Status register的Correctable Error Detect bit來代表偵測到error的發生。
(2) 在advisory non-fatal error case章節有提到，有些case會report UR為advisory non-fatal error，如果是這類error，則在Device Status register設置UR bit。
(3)如果support AER，設置對應的Error Status bit到Correctable Error Status register 來代表偵測到error的發生。
(4)如果Correctable Error Mask register對應的bit被設置，代表不會往上回報error message TLP。
(5) 這個步驟有幾個開關必須要被打開，才會啟動report機制，

首先為Device Control register的Unsupported Request Reporting Enable bit，這個bit只有控制UR的report or not。
第二為Device Control register的Correctable Error Reporting Enable bit。

■ Find out the offset of AER Capability:

尋找AER Capability offset之前，要先從PCIE Extended Capability開始講起。

PCIE Extended Capability通常從Configuration Space的offset 100h開始，然後以Extended Capability header做為開頭(如下圖)。

[15:0]為Extended Capability ID，Capability種類如下面表格所展列
[19:16]代表版本號，
[31:20]為下一個Capability的offset。

Extended Capability ID	Capability Type
0001h	Advanced Error Reporting Extended Capability
0002h	Virtual Channel Extended Capability implemented in a device without an MFVC structure
0003h	Serial Number Extended Capability
0004h	Power Budgeting Extended Capability
0005h	Root Complex Link Declaration Extended Capability
0006h	Root Complex Internal Link Control Extended Capability
0007h	Root Complex Event Collector Endpoint Association Extended Capability
0008h	Multi-Function Virtual Channel (MFVC) Extended Capability
0009h	Virtual Channel Extended Capability implemented in a Multi-Function device with an MFVC structure
000Ah	RCRB Header Extended Capability
000Bh	Vendor-Specific Extended Capability

下圖為尋找Capability的範例，我們使用lspci來dump 03:00.0這個device的registers，由圖中可以看出第一個Extended Capability (100h的位置) 為Advanced Error Reporting Extended Capability(ID為 0001h)，然後下一個Extended Capability的offset為13Ch，由13Ch的ID 0003h可以得知它是Serial Number Extended Capability，以此類推。

當找到AER Capability，可以由下圖找到它對應的register的offset，如Uncorrectable Error Status Register offset為04h，Uncorrectable Error Mask Register為08h，以此類推。

以上圖為例，可以看到110h的地方為Correctable Error Status Register，且bit13為1，表示device有發生過Advisory Non-Fatal Error。

Reference: PCI Express®Base Specification Revision 3.0 November 10, 2010

解答區:

(1)

(2)

此trace為我們故意讓device所發的memory read TLP收不到root port回覆的completion TLP，因此device會同時偵測到Uncorrectable Error的"Completion Timeout"和"ECRC Error Status"並且反應到Uncorrectable Error Status register，接著device就會發出ERR_NONFATAL的message TLP給上一層的root port，rounting欄位為To Root Complex

26 則留言:

L.J.2021年11月4日晚上11:48
Blog主，你好，小弟最近也在處理AER的Error，
但我看這篇文章有點看不懂，有幾個問題想請教：
1.100h指的是指圖片中f0下面的100嗎？
2.為什麼下一個Extended Capability的offset是13ch？
3.110h是指圖片中的110嗎？(f0→100→110，直的看)
4.接上一個問題，如果是，bit13又是在哪裡？
我看110這一行都是00 20 00 .....a0...00，
我沒看到1這個數字，還請您解釋一下
回覆刪除
回覆
我的建築奇幻旅程2022年3月27日晚上11:42
遇到這個 MESSAGE, 是指發生什麼狀況?
[ 100.016115] pcieport 0001:00:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 100.025678] pcieport 0001:00:00.0: device [1957:8d90] error status/mask=00000001/00006000
[ 100.034024] pcieport 0001:00:00.0: [ 0] RxErr (First)
回覆刪除
回覆
大同 Work Notes2022年3月28日凌晨12:18
其實就是Correctable Error中的Receiver Error(RxErr)，根據Gen3 spec裡的"Table 6-3:Physical Layer Error List"有提到四種case，總結大致上應該就是訊號有問題，有Protocol Analyzer的話，可以錄看看LTSSM是否有跑進去Recovery。

1.Disparity是在8b/10b decode中用來平衡電氣訊號的，也就是0和1均勻的打散，應該就是避免雜訊之類的。如以下描述，當有不正確的running disparity或者是收到的Symbol與running disparity table對不起來的話，就是RxErr

■ Section 4.2.1.1.3
If a received Symbol is found in the column corresponding to the incorrect running disparity or if the Symbol does not correspond to either column, the Physical Layer must notify the Data Link Layer that the received Symbol is invalid. This is a Receiver Error, and is a reported error associated with the Port

2.像是以下這些case,也是屬於 RxErr
■ Section 4.2.1.2
TLP開頭不是STP Symbol，結尾不是END Symbol or EDB Symbol
DLLP開頭不是SDP Symbol，結尾不是END Symbol
當TX處於Logical Idle，發現lane有非00h值的data
x1 lane以上的 link，TLP STP Symbol必須放置Lane0
x1 lane以上的 link，DLLP SDP Symbol必須放置Lane0
Receivers may optionally check for violations of the rules of this section. These checks are independently optional (see Section 6.2.3.4). If checked, violations are Receiver Errors

3. 如下描述，8b/10b的部分，像是Framing錯誤、失去Symbol Lock等等，都屬於RxErr。128b/130b的部分，像是失去Block Alignment、Elasticity Buffer overflow等等，都屬於RxErr。

■ Section 4.2.4.7
8b/10b decode errors must be checked and trigger a Receiver Error in specified LTSSM states (see Table 4-14), which is a reported error associated with the Port (see Section 6.2). Triggering a Receiver Error on any or all of Framing Error, Loss of Symbol Lock, Lane Deskew Error, and Elasticity Buffer Overflow/Underflow is optional

128b/130b Framing errors must be checked and trigger a Receiver Error in the LTSSM states specified in Table 4-14. The Receiver Error is a reported error associated with the Port (see Section 6.2). Triggering a Receiver Error on any of all of loss of Block Alignment, Elasticity Buffer Overflow/Underflow, and loss of Lane-to-Lane de-skew is optional

4. 或者是當LTSSM正在的Configuration State和Recovery State的時候發生了Link Error，RxErr也會被report
■ Section 4.2.6
Allowing Receiver Errors to be set while in Configuration or Recovery is intended to allow implementations to report Link Errors that occur while processing packets in those states. For example, if the LTSSM transitions from L0 to Recovery while a TLP is being received, a Link Error that occurs after the LTSSM transition can be reported.

回覆刪除
回覆
達斯汀2022年4月11日凌晨2:59
版主你好，
不好意思，向您請教一下，假設host端是跑VBox虛擬機在intel的cpu上，就算ep端已經設定多組的MSI，host是不是無法分配出超過一組的MSI來用，感謝。
回覆刪除
回覆
Jheng2023年2月8日凌晨4:14
版主你好，請問你有研究過SERR#嗎? 從spec看起來似乎只要有打開Device Control中的error reporting就可以report error了，不知道是不是SERR#這個bit的重要性
回覆刪除
回覆
prabhu2023年12月22日晚上11:14
Hi Dantong , It will be much useful if you can give me an 3 example flows, like my endpoint device detected a correctable, uncorrectable-fatal, non fatal error .how my end point will report it to RC and how RC will react for the corresponding issue.(like in which flow the config reg are read/written by each device through Error message or some other mechanism )
回覆刪除
回覆

新增留言

訂閱：張貼留言 (Atom)

2020年5月25日 星期一

原理PCI Express: Advanced Error Reporting(AER)

26 則留言:

解析 NVM Express - 透過Linux OS 解析M.2 NVMe SSD

2020年5月25日星期一