Something goes wrong with PCIe and Ubuntu freezes only mouse can move but cannot click several times a day on dgx station v100
This happened several times today.
Below is syslog
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427570] NVRM: GPU at PCI:0000:08:00: GPU-ceb60853-2618-02ad-a2a8-d4c72f186f3d
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427572] NVRM: GPU Board Serial Number: 0324418141428
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427574] NVRM: Xid (PCI:0000:08:00): 79, pid=‘’, name=, GPU has fallen off the bus.
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427576] NVRM: GPU 0000:08:00.0: GPU has fallen off the bus.
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.427577] NVRM: GPU 0000:08:00.0: GPU serial number is 0324418141428.
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.428045] NVRM: A GPU crash dump has been created. If possible, please run
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.428045] NVRM: nvidia-bug-report.sh as root to collect this data before
Jan 4 16:12:11 ovsdl-DGX-Station kernel: [ 1115.428045] NVRM: the NVIDIA kernel module is unloaded.
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1124.448571] NVRM: Xid (PCI:0000:07:00): 8, pid=3896, name=msedge, Channel 00000038
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963866] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963873] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963879] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963883] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963887] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.963891] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981359] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981364] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981368] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981371] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981374] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.981377] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998868] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998873] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998878] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998881] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998885] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1125.998887] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016377] NVRM: Xid (PCI:0000:07:00): 79, pid=‘’, name=, GPU has fallen off the bus.
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016379] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016381] NVRM: GPU 0000:07:00.0: GPU has fallen off the bus.
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016383] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016385] NVRM: GPU 0000:07:00.0: GPU serial number is 0324418141545.
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016388] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016391] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016394] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.016397] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.033892] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.033897] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.033900] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.033903] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.033909] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.033911] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.034432] NVRM: A GPU crash dump has been created. If possible, please run
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.034432] NVRM: nvidia-bug-report.sh as root to collect this data before
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.034432] NVRM: the NVIDIA kernel module is unloaded.
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.051401] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.051406] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.051409] pcieport 0000:00:02.0: device [8086:6f04] error status/mask=00004000/00000000
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.051412] pcieport 0000:00:02.0: [14] Completion Timeout (First)
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.051416] pcieport 0000:00:02.0: broadcast error_detected message
Jan 4 16:12:21 ovsdl-DGX-Station kernel: [ 1126.051418] pcieport 0000:00:02.0: AER: Device recovery failed
Jan 4 16:12:25 ovsdl-DGX-Station /usr/lib/gdm3/gdm-x-session[3042]: (EE) client bug: timer event2 debounce: offset negative (-923ms)
Jan 4 16:12:25 ovsdl-DGX-Station /usr/lib/gdm3/gdm-x-session[3042]: (EE) client bug: timer event2 debounce short: offset negative (-936ms)
Jan 4 16:12:25 ovsdl-DGX-Station kernel: [ 1129.949400] nvidia-modeset: ERROR: GPU:0: Failed to query display engine channel state: 0x0000c37e:0:0:0x0000000f