(+86) 15013630202 sales@pcie.com

Performance issues of data transmission speed in PCIe EP mode

Mar 04, 2024

Hi,


After I confirmed that the connectivity of PCIe EP mode is correct:


I try to test data transfer speed with mmap and memcpy, The relevant information of EP and RP is as follows:


EP side (Orin):


root@orin:~# dmesg | grep pci_epf_nv_test
[ 3754.209715] pci_epf_nv_test pci_epf_nv_test.0: BAR0 RAM phys: 0x12f09c000
[ 3754.209745] pci_epf_nv_test pci_epf_nv_test.0: BAR0 RAM IOVA: 0xffff0000
[ 3754.209792] pci_epf_nv_test pci_epf_nv_test.0: BAR0 RAM virt: 0x00000000fc0b932a

RP side (x86):


# pci device tree
root@8208:~# lspci -tvv
-[0000:00]-+-00.0 Intel Corporation 8th Gen Core 8-core Desktop Processor Host Bridge/DRAM Registers [Coffee Lake S]
+-01.0-[01-05]--+-00.0 NVIDIA Corporation Device 2216
| \-00.1 NVIDIA Corporation Device 1aef
+-01.1-[06]----00.0 NVIDIA Corporation Device 0001

# PCIe theoretical bandwidth:
root@8208:~# lspci -s 06:00.0 -vvv
06:00.0 RAM memory: NVIDIA Corporation Device 0001
Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- Interrupt: pin A routed to IRQ 10
Region 0: Memory at 94e00000 (32-bit, non-prefetchable) [size=64K]
Region 2: Memory at 94900000 (64-bit, prefetchable) [size=128K]
Region 4: Memory at 94e10000 (64-bit, non-prefetchable) [size=4K]
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <1us, L1 <64us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s (downgraded), Width x8 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR+
10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, TPHComp-, ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-

From the above information, it can be calculated that the theoretical bandwidth is about 8GB/s.


So, I tested the actual bandwidth through memcpy, it mainly includes the following two aspects:



  1. Allocate memory through malloc, then memcpy.


    static void BM_memcpy(benchmark::State& state) {
int64_t size = state.range(0);
char* src = (char*) malloc(size);
memset(src, 'b', size);
char* dest = (char*) malloc(size);
for (auto _: state) {
memcpy(dest, src, size);
}
state.SetBytesProcessed(int64_t(state.iterations()) * size);
}


  1. Map Shared RAM via mmap, then memcpy:


#define MAP_SIZE (1024 * 64)
#define MAP_MASK (MAP_SIZE - 1)

void* map_base = nullptr;
void* virt_addr = nullptr;
uint64_t target = 0x94e00000; // ep phy address
int map_fd = open("/dev/mem", O_RDWR | O_ASYNC);
void* map_base = mmap(0, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, map_fd, target & ~MAP_MASK);
virt_addr = map_base + (target & MAP_MASK);
int64_t size = state.range(0) > MAP_SIZE ? MAP_SIZE : state.range(0);
char* src = (char*) malloc(size);
memset(src, 'b', size);
char* dest = (char*) virt_addr;
for (auto _: state) {
memcpy(dest, src, size);
}
state.SetBytesProcessed(int64_t(state.iterations()) * size);

Then I run the benchmark in the rp side (x86), the result is:


    Run on (16 X 5000 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 256 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.07, 0.03, 0.01
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------
BM_memcpy/1024 7.57 ns 7.57 ns 95032348 bytes_per_second=125.938G/s
BM_memcpy/4096 28.1 ns 28.1 ns 24437451 bytes_per_second=135.9G/s
BM_memcpy/16384 219 ns 219 ns 3203777 bytes_per_second=69.7667G/s
BM_memcpy/65536 1096 ns 1096 ns 678141 bytes_per_second=55.6696G/s
BM_mempcy_target_addr/1024 1764 ns 1763 ns 396956 bytes_per_second=553.862M/s
BM_mempcy_target_addr/4096 7375 ns 7375 ns 94748 bytes_per_second=529.644M/s
BM_mempcy_target_addr/16384 11867018 ns 11866844 ns 57 bytes_per_second=1.31669M/s
BM_mempcy_target_addr/65536 58472399 ns 58471480 ns 12 bytes_per_second=1094.55k/s

According to the result of memory_target_addr, the actual bandwidth is up to 500MB/s, so is there any information I missed here, which causes my test results to vary greatly