NAB Show

NAB Show

Session.

Performance Optimization of SMPTE 2110 Applications on COTS Hardware, with Network & JPEG XS Offloading to DPUs

Tuesday, April 21 | 10:40 – 11 p.m.

Broadcast Engineering and IT Conference

As media workflows migrate to software-defined frameworks running on common-off-the-shelf (COTS) hardware, with network-based media essence data exchange, applications executing in this environment can suffer performance and compliance issues due to CPU and/or GPU overloading.

In such cases, applications can benefit from more manageable host resources usage by offloading operations to dedicated Data Processing Units (DPUs) and smart Network Interface Cards (NICs). Dedicated DPUs and smart NICs implement the complete network stack in dedicated hardware and utilize kernel bypass to perform network communication without impacting host system operations.  DPUs contain additional general purpose processing cores that can be utilized for various types of network stream processing such as encryption or media essence transcoding – one example being JPEG XS.

In Broadcast, an example of offloading is to implement the SMPTE ST 2110 stack on the DPUs.  The DPUs can reasonably support one video stream: uncompressed streams have significant data rates and thus put significant loads on the DPU cores and memory bandwidth. On the other hand, compressed JPEG XS streams lower the data rates, and thus reduce the overall load on the DPU.  This enables a higher stream density, with no compromise on latency, image quality, and lowers power consumption compared to uncompressed video. Thus, JPEG XS in SMPTE 2110 (ST 2110-22) is fully designed to replace uncompressed video at ~10% of the bandwidth. It is a very platform flexible CODEC that can be implemented in hardware or as lightweight parallelizable software.  

The migration of network transmit and receive operations as well as the JPEG XS transcoding operations to a DPU introduces a reduction in the host CPU usage, freeing more cycles for application functions.  Direct Memory Access (DMA) transfers between the DPU and GPU eliminate PCIE bus transactions to and from the host memory that can introduce system jitter.  While JPEG XS processing could be performed on the GPU to offload the CPU, moving it to the DPU cores leaves 100% of the GPU processing capabilities for media and AI functions. Moreover, the introduction of JPEG XS transcoding reduces the network bandwidth, allowing for more efficient network usage. 

This paper will compare the tradeoffs and system performance impacts of using DPUs for media transcode and network functions. It will then demonstrate results for real-world industry use cases in media production.