Home » News & Events » TRANSCEIVERS » 800G OSFP NDR

Performance issues of data transmission speed in PCIe EP mode

Mar 04, 2024

Hi,

After I confirmed that the connectivity of PCIe EP mode is correct：

I try to test data transfer speed with mmap and memcpy, The relevant information of EP and RP is as follows：

EP side (Orin):

root@orin:~# dmesg | grep pci_epf_nv_test

[ 3754.209715] pci_epf_nv_test pci_epf_nv_test.0: BAR0 RAM phys: 0x12f09c000

[ 3754.209745] pci_epf_nv_test pci_epf_nv_test.0: BAR0 RAM IOVA: 0xffff0000

[ 3754.209792] pci_epf_nv_test pci_epf_nv_test.0: BAR0 RAM virt: 0x00000000fc0b932a

RP side (x86):

# pci device tree

root@8208:~# lspci -tvv

-[0000:00]-+-00.0  Intel Corporation 8th Gen Core 8-core Desktop Processor Host Bridge/DRAM Registers [Coffee Lake S]

           +-01.0-[01-05]--+-00.0  NVIDIA Corporation Device 2216

           |               \-00.1  NVIDIA Corporation Device 1aef

           +-01.1-[06]----00.0  NVIDIA Corporation Device 0001



# PCIe theoretical bandwidth:

root@8208:~# lspci -s 06:00.0 -vvv

06:00.0 RAM memory: NVIDIA Corporation Device 0001

	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-

	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- 
	Interrupt: pin A routed to IRQ 10

	Region 0: Memory at 94e00000 (32-bit, non-prefetchable) [size=64K]

	Region 2: Memory at 94900000 (64-bit, prefetchable) [size=128K]

	Region 4: Memory at 94e10000 (64-bit, non-prefetchable) [size=4K]

	Capabilities: [70] Express (v2) Endpoint, MSI 00

		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited

			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75.000W

		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-

			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+

			MaxPayload 256 bytes, MaxReadReq 512 bytes

		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-

		LnkCap:	Port #0, Speed 16GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <1us, L1 <64us

			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+

		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+

			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

		LnkSta:	Speed 8GT/s (downgraded), Width x8 (ok)

			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR+

			 10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-

			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-

			 FRS-, TPHComp-, ExtTPHComp-

			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-

		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled

			 AtomicOpsCtl: ReqEn-

		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-

			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-

			 Compliance De-emphasis: -6dB

		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+

			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-

From the above information, it can be calculated that the theoretical bandwidth is about 8GB/s.

So, I tested the actual bandwidth through memcpy, it mainly includes the following two aspects:

Allocate memory through malloc, then memcpy.

    static void BM_memcpy(benchmark::State& state) {

	int64_t size = state.range(0);

    char* src = (char*) malloc(size);

    memset(src, 'b', size);

    char* dest = (char*) malloc(size);

	for (auto _: state) {

        memcpy(dest, src, size);

	}

    state.SetBytesProcessed(int64_t(state.iterations()) * size);

}

Map Shared RAM via mmap, then memcpy:

#define MAP_SIZE (1024 * 64)

#define MAP_MASK (MAP_SIZE - 1)



void* map_base = nullptr;

void* virt_addr = nullptr; 

 uint64_t target = 0x94e00000; // ep phy address

  int map_fd = open("/dev/mem", O_RDWR | O_ASYNC);

  void* map_base = mmap(0, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, map_fd, target & ~MAP_MASK);

  virt_addr = map_base + (target & MAP_MASK);

  int64_t size = state.range(0) > MAP_SIZE ? MAP_SIZE : state.range(0);

    char* src = (char*) malloc(size);

    memset(src, 'b', size);

    char* dest = (char*) virt_addr;

	for (auto _: state) {

        memcpy(dest, src, size);

	}

    state.SetBytesProcessed(int64_t(state.iterations()) * size);

Then I run the benchmark in the rp side (x86), the result is:

    Run on (16 X 5000 MHz CPU s)

CPU Caches:

  L1 Data 32 KiB (x8)

  L1 Instruction 32 KiB (x8)

  L2 Unified 256 KiB (x8)

  L3 Unified 16384 KiB (x1)

Load Average: 0.07, 0.03, 0.01

***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.

--------------------------------------------------------------------------------------

Benchmark                            Time             CPU   Iterations UserCounters...

--------------------------------------------------------------------------------------

BM_memcpy/1024                    7.57 ns         7.57 ns     95032348 bytes_per_second=125.938G/s

BM_memcpy/4096                    28.1 ns         28.1 ns     24437451 bytes_per_second=135.9G/s

BM_memcpy/16384                    219 ns          219 ns      3203777 bytes_per_second=69.7667G/s

BM_memcpy/65536                   1096 ns         1096 ns       678141 bytes_per_second=55.6696G/s

BM_mempcy_target_addr/1024        1764 ns         1763 ns       396956 bytes_per_second=553.862M/s

BM_mempcy_target_addr/4096        7375 ns         7375 ns        94748 bytes_per_second=529.644M/s

BM_mempcy_target_addr/16384   11867018 ns     11866844 ns           57 bytes_per_second=1.31669M/s

BM_mempcy_target_addr/65536   58472399 ns     58471480 ns           12 bytes_per_second=1094.55k/s

According to the result of memory_target_addr, the actual bandwidth is up to 500MB/s, so is there any information I missed here, which causes my test results to vary greatly