Leverage PowerScale with GPU Direct Storage – Create a data loader with NVIDIA DALI

Introduction

In our journey to optimize high-performance computing environments, we’ve previously explored PowerScale multipath driver testing and its impact on performance. Now, we’re taking the next exciting step: integrating NVIDIA GPU Direct Storage (GDS) with Dell PowerScale using NFS over RDMA. This is a powerful combination to help with throughput for data-intensive workloads in AI and analytics.

Why Use GDS with PowerScale and RDMA?

The speed of data transfer from storage to GPU memory is crucial. Traditional methods, which route data through the CPU, introduce latency and consume valuable resources. GDS, especially when paired with RDMA (Remote Direct Memory Access), offers a game-changing solution:

Reduced CPU overhead: By bypassing the CPU, GDS frees up computational resources for other tasks.
Faster read and write operations: Direct data paths mean quicker data access for your GPUs.
Lower latency: Minimizing data hops translates to more responsive systems.

Dell PowerScale’s support for NFS over RDMA with OneFS enables GDS to transfer data directly between the GPU and the storage system, maximizing efficiency.

Our High-Performance Infrastructure

For this exploration, we’ve assembled a cutting-edge setup that mirrors the configuration outlined in Dell’s GPU Direct Storage configuration guide – you can read more about that guide here. Our infrastructure includes:

Dell PowerScale F600 nodes: Equipped with NVMe drives for lightning-fast storage.
Dell PowerSwitch Z-series: High-performance network switches (Z9264F-ON and Z9332F-ON) designed for building high-capacity network fabrics.
Dell PowerEdge R7525 servers: 2X NVIDIA A100 GPUs and Mellanox ConnectX-6 NICs to handle compute-intensive workloads.

Configuring PowerScale for GDS: Key Settings

To achieve optimal performance, we implemented several critical settings:

Disabled Compression and Deduplication: This ensures direct data transfer without additional processing overhead.

   isi compression settings modify --enabled=0
   isi dedupe inline settings modify --mode=disabled

Disabled Endurant Cache: We streamlined the data path further by turning off Endurant Cache (EC).

   isi_for_array sysctl efs.bam.ec.mode=0

Enabled Jumbo Frames: This facilitates large data transfers, reducing packet numbers and lowering CPU utilization.
Enabled NFS Over RDMA: We activated this on both network pools and global settings to maximize performance.
Mount Point Mapping: Each GPU is mapped to a dedicated PowerScale frontend IP and NIC, respecting NUMA node affinity and distributing I/O operations evenly.

Check out the previous blog posts with regards to a full NFSvRDMA setup and config walkthrough.

Understanding NUMA Node Affinity and PCIe Topology

A crucial aspect of optimizing GDS performance is understanding and leveraging NUMA (Non-Uniform Memory Access) node affinity and PCIe topology. By ensuring that GPUs and NICs are grouped within the same NUMA node, we minimize communication hops, leading to enhanced performance.

We use NVIDIA’s nvidia-smi tool to check the PCIe topology and affinity:

nvidia-smi topo -m

This output helps us understand the optimal data paths for RDMA and GDS. We also use lspci and lstopo to identify PCIe devices and map them to their NUMA nodes.

Benchmarking with GDS

To put our setup to the test, we use NVIDIA’s gdsio utility. This tool simulates various I/O workloads on the storage system, including sequential reads and writes with different configurations.

Key Recommendations for Optimal GDS Performance

Based on our testing and best practices, we recommend:

PCIe Topology Alignment: Ensure GPU and NIC alignment within the same NUMA node.
File Pool Policy: Set to streaming mode for optimized large sequential read/write operations.
Dynamic Routing for RDMA: Enable this to allow the system to adapt the data path for the most efficient transfer route.

Hands-on: Setting Up and Testing GDS

Let’s walk through the process of setting up and validating GDS on our system:

1. Installing NVIDIA GDS

Follow the NVIDIA GDS Installation Guide to install NVIDIA drivers, nvidia-fs, and the CUDA toolkit.

2. Configuring PowerScale for GDS

Apply the settings we discussed earlier:

isi compression settings modify --enabled=0
isi dedupe inline settings modify --mode=disabled

Ensure NFS over RDMA is enabled on the PowerScale subnet and network pool.

3. Mounting the PowerScale NFS Share Using RDMA

mount -o rdma,vers=3 172.17.2.101:/ifs/RDMA-Test /mnt/RDMA

Confirm the mount with df -h:

Filesystem                   Size  Used Avail Use% Mounted on
172.17.2.101:/ifs/RDMA-Test  110T   43T   64T  41% /mnt/RDMA

4. Testing and Validation

NUMA Topology Verification

Use nvidia-smi topo -m to check the NUMA topology:

[root@hop-r7525-05 tools]# nvidia-smi topo -m
        GPU0    GPU1    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    NODE    NODE    64-127  1               N/A
GPU1    NODE     X      NODE    NODE    64-127  1               N/A
NIC0    NODE    NODE     X      PIX
NIC1    NODE    NODE    PIX      X

This output shows both GPUs and NICs within the same NUMA node (Node 1), which is optimal for RDMA performance.

GDS Verification

Use the gdscheck.py tool to validate GDS configuration:

./gdscheck.py -p

Sample output: (Success! we know its works!)

GDS release version: 1.11.1.6
nvidia_fs version:  2.22 libcufile version: 2.12
...
NFS: Supported
...
rdma devices: Configured

Performance Testing

Let’s run some tests using gdsio:

Write Performance with GDS:

sudo ./gdsio -f /mnt/RDMA/testfile -d 0 -m 0 -s 10G -i 1M -w 10 -x 0 -I 1

Output:

IoType: WRITE XferType: GPUD Threads: 10 DataSetSize: 9669632/10485760(KiB) IOSize: 1024(KiB) Throughput: 1.489903 GiB/sec, Avg_Latency: 6543.972718 usecs

Read Performance with GDS:

sudo ./gdsio -f /mnt/RDMA/testfile -d 0 -m 0 -s 10G -i 1M -w 10 -x 0 -I 0

Output:

IoType: READ XferType: GPUD Threads: 10 DataSetSize: 10177536/10485760(KiB) IOSize: 1024(KiB) Throughput: 6.096986 GiB/sec, Avg_Latency: 1569.551947 usecs

5. Monitoring and Troubleshooting

Use nvidia-smi to monitor GPU activity during tests:

watch -n 1 nvidia-smi

For detailed GDS statistics, use gds_stats:

./gds_stats -p <pid>

Real-World Application: Accelerating Data Loading with NVIDIA DALI

While benchmarks are crucial, the real test lies in practical applications. Enter NVIDIA DALI (Data Loading Library), a game-changer for optimizing data preprocessing in machine learning workflows.

Why NVIDIA DALI?

DALI addresses a common bottleneck in deep learning: data loading and preprocessing. By leveraging GPU acceleration, DALI ensures your model’s training pipeline operates at maximum efficiency, reducing overall training time.

Key benefits include:

Maximizing GPU utilization
Creating highly efficient pipelines
Cross-framework compatibility (TensorFlow, PyTorch, MXNet)

Getting Started with NVIDIA DALI

Let’s create a simple data loader using DALI to process images from the COCO dataset:

First, install DALI:

   pip install nvidia-dali

Download and extract the COCO dataset:

   wget http://images.cocodataset.org/zips/train2017.zip
   unzip train2017.zip -d /mnt/RDMA/2017/

Here’s a Python script that creates a DALI pipeline for loading and processing images:

[Previous content remains unchanged]

Deep Dive: Understanding the NVIDIA DALI Loader Script

Let’s break down our DALI loader script to understand how it efficiently loads and processes images using GPU acceleration. This script demonstrates the power of combining GDS with DALI for high-performance data preprocessing from the PowerScale array.

Script can be downloaded from here – https://github.com/DellGEOS/GDS/tree/main

import os
import numpy as np
from nvidia.dali import pipeline_def
import nvidia.dali.fn as fn
import nvidia.dali.types as types
from nvidia.dali.plugin.pytorch import DALIGenericIterator

# Directory containing images
file_root = "/mnt/RDMA/train2017"
image_files = [os.path.join(file_root, f) for f in os.listdir(file_root) if f.endswith(('.jpg', '.png'))]
print(f"Found {len(image_files)} images.")

This section sets up our environment:

We import necessary libraries, including DALI components.
We define the directory where our images are stored (file_root).
We create a list of all JPEG and PNG files in the directory.

class ExternalInputIterator:
    def __init__(self, files, batch_size):
        self.files = files
        self.batch_size = batch_size
        self.index = 0

    def __iter__(self):
        self.index = 0
        return self

    def __next__(self):
        if self.index >= len(self.files):
            raise StopIteration

        batch = []
        for _ in range(self.batch_size):
            if self.index >= len(self.files):
                break
            with open(self.files[self.index], 'rb') as f:
                batch.append(np.frombuffer(f.read(), dtype=np.uint8))
            self.index += 1
        return batch

The ExternalInputIterator class is a custom iterator that:

Initializes with a list of files and a batch size.
Implements the iterator protocol (__iter__ and __next__ methods).
Reads files in batches, converting each image file to a NumPy array of bytes.
This allows DALI to efficiently load data from our GDS-mounted storage.

@pipeline_def
def create_pipeline():
    jpegs = fn.external_source(source=eii, dtype=types.UINT8)
    images = fn.decoders.image(jpegs, device="mixed")
    images = fn.resize(images, device="gpu", resize_x=224, resize_y=224)
    return images

The create_pipeline function defines our DALI pipeline:

It uses the @pipeline_def decorator to create a DALI pipeline.
fn.external_source reads data from our custom iterator.
fn.decoders.image decodes the JPEG data on a “mixed” device (CPU or GPU, whichever is more efficient).
fn.resize resizes the images to 224×224 pixels, explicitly on the GPU for speed.

# Create an instance of the ExternalInputIterator
batch_size = 32
eii = ExternalInputIterator(image_files, batch_size)

# Instantiate and build the pipeline
pipe = create_pipeline(batch_size=batch_size, num_threads=4, device_id=0)
pipe.build()

# Create a DALI iterator to test loading the images
dali_iter = DALIGenericIterator(pipe, ['data'], size=len(image_files))

# Test loading images using the DALI iterator
for data in dali_iter:
    print("Data loaded successfully!")
    print("Shape of the batch:", data[0]['data'].shape)
    break  # Just check the first batch

This final section puts everything together:

We create an instance of our ExternalInputIterator.
We instantiate and build the DALI pipeline, specifying batch size, number of threads, and GPU device ID.
We create a DALIGenericIterator, which wraps our pipeline for easy integration with PyTorch.
Finally, we test the iterator by loading one batch of images and printing its shape.

Key Benefits of This Approach

GPU Acceleration: By using DALI and GDS, we offload image decoding and resizing to the GPU, significantly speeding up these operations.
Efficient Data Loading: The ExternalInputIterator allows us to read data directly from our GDS-mounted storage, minimizing I/O bottlenecks.
Scalability: This approach can handle large datasets efficiently, as demonstrated by processing over 118,000 images from the COCO dataset.
Framework Integration: While we’re using PyTorch in this example (via DALIGenericIterator), DALI can integrate with other frameworks like TensorFlow as well.
Preprocessing Pipeline: The DALI pipeline allows us to easily chain multiple preprocessing steps (like decoding and resizing) that are executed efficiently on the GPU.

By leveraging NVIDIA DALI in combination with GDS, we’ve created a high-performance data loading and preprocessing pipeline. This approach can significantly reduce the time spent on data preparation, allowing researchers and data scientists to focus more on model development and training.

Running this script yields

Found 118287 images.
Data loaded successfully!
Shape of the batch: torch.Size([32, 224, 224, 3])

This output confirms that we’ve successfully created a data loading pipeline using GDS and DALI, efficiently processing a batch of 32 images, each resized to 224×224 pixels with 3 color channels.

Conclusion

By integrating NVIDIA GPU Direct Storage with Dell PowerScale and leveraging NVIDIA DALI, we’ve unlocked a new level of performance for data-intensive workloads. Our tests demonstrate significant improvements in throughput, confirming the effectiveness of GDS in optimizing storage-to-GPU data transfers.

This setup not only accelerates raw data transfer but also streamlines the entire data preprocessing pipeline, crucial for machine learning workflows. As we’ve seen with our DALI example, this combination allows for efficient, GPU-accelerated data loading from our PowerScale array and preprocessing, setting the stage for faster model training and inference.

For data scientists and AI researchers working with large datasets, this architecture provides a robust foundation in deep learning and analytics.

Next Steps

With this powerful infrastructure in place, the next logical step is to fine-tune a vision model using our optimized data pipeline. This could involve:

Implementing a full training loop using a framework like PyTorch or TensorFlow
Benchmarking training times with and without GDS to quantify performance gains
Exploring more complex data augmentation techniques leveraging DALI’s capabilities