As broadcast systems continue their transition from dedicated hardware appliances to standard commercial off-the-shelf (COTS) computing platforms, the efficiency of data movement becomes a critical factor in overall system performance.
In particular, the growing reliance on Graphic Processing Unit (GPU)-accelerated processing for real-time media encoding, transcoding, AI, and image manipulation demands optimized data exchange both within compute nodes and across distributed systems. This shift introduces new challenges in memory sharing and communication, especially when dealing with heterogeneous memory hierarchies and high-throughput, low-latency requirements.
Using internode [1] Remote Direct Memory Access (RDMA) transfer as a baseline: This study presents a comparative evaluation of memory sharing mechanisms across three distinct GPU transfer paths: inter-node GPU-GPU, intra-node host-GPU and intra-node GPU-GPU memory exchange. The performance of each path is assessed under different software configurations: native memory operations without the aid of any communication framework – commonly referred to as native, and higher-level abstractions, like [2] Unified Communication X (UCX) and [3] Libfabric. These configurations represent a range of abstraction levels and transport optimizations commonly used in High-Performance Computing (HPC) and media processing workloads.
All experiments are performed on a uniform hardware platform featuring GPUs with programmable interfaces such as [4] Compute Unified Device Architecture (CUDA) enabling developers to offload and execute code directly on NVIDIA GPU. The system is equipped with network interfaces that support RDMA allowing for low-latency, high-throughput data transfers. Key performance metrics, including [5] PCIe (Peripheral Component Interconnect Express) bandwidth, latency and Central Processing Unit (CPU) utilization, are measured to evaluate the impact of each communication method on data transfer efficiency.
These results are intended to support the development of high-performance applications that require efficient and cost-effective memory sharing between computational devices, particularly in domains such as media processing, machine learning, and scientific computing where data locality and transport efficiency are essential. In addition, this study highlights how higher-level communication frameworks provide flexible abstraction.
Sithideth ViengkhouDirector – Incubator GroupRiedel CommunicationsSpeaker