Home » News & Events » TRANSCEIVERS

Jetson Xavier NX - issue with the PCIe communication

Mar 04, 2024

We are facing an issue with the PCIe communication on our Jetson Xavier NX modules. The Jetson module is connected via an PI7C9X2G608GP switch to four M.2 connectors. In the current configuration, they are equipped with Google Corals modules and an NVMe SSD. We noticed that PCIe bus errors occasionally appear in the kernel log and that communication (mostly of the coral) modules is affected. This seems to happen mainly when data is transferred to the SSD and other communication happens at the same time. E.g. even if the coral modules are not actively used, but data is written to the SSD, the following errors appear in the kernel log:

[ 256.653687] apex 0005:05:00.0: Apex performance not throttled due to temperature

[ 256.671235] pcieport 0005:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000

[ 256.671260] pcieport 0005:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)

[ 256.671734] pcieport 0005:00:00.0: device [10de:1ad0] error status/mask=00004000/00400000

[ 256.672057] pcieport 0005:00:00.0: [14] Completion Timeout (First)

[ 256.672309] pcieport 0005:00:00.0: broadcast error_detected message

[ 256.672329] pcieport 0005:00:00.0: AER: Device recovery failed

[ 318.093900] apex 0005:05:00.0: Apex performance not throttled due to temperature

[ 318.112313] pcieport 0005:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000

[ 318.112333] pcieport 0005:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)

[ 318.112823] pcieport 0005:00:00.0: device [10de:1ad0] error status/mask=00004000/00400000

[ 318.113116] pcieport 0005:00:00.0: [14] Completion Timeout (First)

[ 318.113354] pcieport 0005:00:00.0: broadcast error_detected message

[ 318.113373] pcieport 0005:00:00.0: AER: Device recovery failed

[ 1347.198116] apex 0005:03:00.0: Apex performance not throttled due to temperature

[ 1347.225990] pcieport 0005:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000

[ 1347.226049] pcieport 0005:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)

[ 1347.226702] pcieport 0005:00:00.0: device [10de:1ad0] error status/mask=00004000/00400000

[ 1347.227059] pcieport 0005:00:00.0: [14] Completion Timeout (First)

[ 1347.227325] pcieport 0005:00:00.0: broadcast error_detected message

[ 1347.227347] pcieport 0005:00:00.0: AER: Device recovery failed

[ 1398.398871] apex 0005:04:00.0: Apex performance not throttled due to temperature

[ 1398.433989] pcieport 0005:00:00.0: AER: Uncorrected (Non-Fatal) error received: id=0000

[ 1398.434009] pcieport 0005:00:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0000(Requester ID)

[ 1398.434447] pcieport 0005:00:00.0: device [10de:1ad0] error status/mask=00004000/00400000

[ 1398.434824] pcieport 0005:00:00.0: [14] Completion Timeout (First)

[ 1398.435091] pcieport 0005:00:00.0: broadcast error_detected message

[ 1398.435110] pcieport 0005:00:00.0: AER: Device recovery failed

However, it does not seem to be caused specifically by the coral modules, because the errors also occur without these modules, e.g. when calling lspci -vvv during an SSD data transfer. We have already tried disabling ASPM in the kernel command line and increasing the completion timeout. We have also tested several SSD from different manufacturers to make sure that the problem is not related to the SSD chipset or firmware. The only interesting finding from these tests is that the errors seem to occur much more often with M-key SSDs than with B+M-key SSDs.