Introduction
In our journey to optimize high-performance computing environments, we’ve previously explored PowerScale multipath driver testing and its impact on performance. Now, we’re taking the next exciting step: integrating NVIDIA GPU Direct Storage (GDS) with Dell PowerScale using NFS over RDMA. This is a powerful combination to help with throughput for data-intensive workloads in AI and analytics.

Why Use GDS with PowerScale and RDMA?
The speed of data transfer from storage to GPU memory is crucial. Traditional methods, which route data through the CPU, introduce latency and consume valuable resources. GDS, especially when paired with RDMA (Remote Direct Memory Access), offers a game-changing solution:
- Reduced CPU overhead: By bypassing the CPU, GDS frees up computational resources for other tasks.
- Faster read and write operations: Direct data paths mean quicker data access for your GPUs.
- Lower latency: Minimizing data hops translates to more responsive systems.
Dell PowerScale’s support for NFS over RDMA with OneFS enables GDS to transfer data directly between the GPU and the storage system, maximizing efficiency.
Our High-Performance Infrastructure
For this exploration, we’ve assembled a cutting-edge setup that mirrors the configuration outlined in Dell’s GPU Direct Storage configuration guide – you can read more about that guide here. Our infrastructure includes:

- Dell PowerScale F600 nodes: Equipped with NVMe drives for lightning-fast storage.
- Dell PowerSwitch Z-series: High-performance network switches (Z9264F-ON and Z9332F-ON) designed for building high-capacity network fabrics.
- Dell PowerEdge R7525 servers: 2X NVIDIA A100 GPUs and Mellanox ConnectX-6 NICs to handle compute-intensive workloads.
Configuring PowerScale for GDS: Key Settings
To achieve optimal performance, we implemented several critical settings:
- Disabled Compression and Deduplication: This ensures direct data transfer without additional processing overhead.
isi compression settings modify --enabled=0
isi dedupe inline settings modify --mode=disabled
- Disabled Endurant Cache: We streamlined the data path further by turning off Endurant Cache (EC).
isi_for_array sysctl efs.bam.ec.mode=0
- Enabled Jumbo Frames: This facilitates large data transfers, reducing packet numbers and lowering CPU utilization.
- Enabled NFS Over RDMA: We activated this on both network pools and global settings to maximize performance.
- Mount Point Mapping: Each GPU is mapped to a dedicated PowerScale frontend IP and NIC, respecting NUMA node affinity and distributing I/O operations evenly.
Check out the previous blog posts with regards to a full NFSvRDMA setup and config walkthrough.
Understanding NUMA Node Affinity and PCIe Topology
A crucial aspect of optimizing GDS performance is understanding and leveraging NUMA (Non-Uniform Memory Access) node affinity and PCIe topology. By ensuring that GPUs and NICs are grouped within the same NUMA node, we minimize communication hops, leading to enhanced performance.
We use NVIDIA’s nvidia-smi tool to check the PCIe topology and affinity:
nvidia-smi topo -m
This output helps us understand the optimal data paths for RDMA and GDS. We also use lspci and lstopo to identify PCIe devices and map them to their NUMA nodes.
Benchmarking with GDS
To put our setup to the test, we use NVIDIA’s gdsio utility. This tool simulates various I/O workloads on the storage system, including sequential reads and writes with different configurations.
Key Recommendations for Optimal GDS Performance
Based on our testing and best practices, we recommend:
- PCIe Topology Alignment: Ensure GPU and NIC alignment within the same NUMA node.
- File Pool Policy: Set to streaming mode for optimized large sequential read/write operations.
- Dynamic Routing for RDMA: Enable this to allow the system to adapt the data path for the most efficient transfer route.
Hands-on: Setting Up and Testing GDS
Let’s walk through the process of setting up and validating GDS on our system:
1. Installing NVIDIA GDS
Follow the NVIDIA GDS Installation Guide to install NVIDIA drivers, nvidia-fs, and the CUDA toolkit.
2. Configuring PowerScale for GDS
Apply the settings we discussed earlier:
isi compression settings modify --enabled=0
isi dedupe inline settings modify --mode=disabled
Ensure NFS over RDMA is enabled on the PowerScale subnet and network pool.
3. Mounting the PowerScale NFS Share Using RDMA
mount -o rdma,vers=3 172.17.2.101:/ifs/RDMA-Test /mnt/RDMA
Confirm the mount with df -h:
Filesystem Size Used Avail Use% Mounted on
172.17.2.101:/ifs/RDMA-Test 110T 43T 64T 41% /mnt/RDMA
4. Testing and Validation
NUMA Topology Verification
Use nvidia-smi topo -m to check the NUMA topology:
[root@hop-r7525-05 tools]# nvidia-smi topo -m
GPU0 GPU1 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE NODE NODE 64-127 1 N/A
GPU1 NODE X NODE NODE 64-127 1 N/A
NIC0 NODE NODE X PIX
NIC1 NODE NODE PIX X
This output shows both GPUs and NICs within the same NUMA node (Node 1), which is optimal for RDMA performance.
GDS Verification
Use the gdscheck.py tool to validate GDS configuration:
./gdscheck.py -p
Sample output: (Success! we know its works!)
GDS release version: 1.11.1.6
nvidia_fs version: 2.22 libcufile version: 2.12
...
NFS: Supported
...
rdma devices: Configured
Performance Testing
Let’s run some tests using gdsio:
Write Performance with GDS:
sudo ./gdsio -f /mnt/RDMA/testfile -d 0 -m 0 -s 10G -i 1M -w 10 -x 0 -I 1
Output:
IoType: WRITE XferType: GPUD Threads: 10 DataSetSize: 9669632/10485760(KiB) IOSize: 1024(KiB) Throughput: 1.489903 GiB/sec, Avg_Latency: 6543.972718 usecs
Read Performance with GDS:
sudo ./gdsio -f /mnt/RDMA/testfile -d 0 -m 0 -s 10G -i 1M -w 10 -x 0 -I 0
Output:
IoType: READ XferType: GPUD Threads: 10 DataSetSize: 10177536/10485760(KiB) IOSize: 1024(KiB) Throughput: 6.096986 GiB/sec, Avg_Latency: 1569.551947 usecs
5. Monitoring and Troubleshooting
Use nvidia-smi to monitor GPU activity during tests:
watch -n 1 nvidia-smi
For detailed GDS statistics, use gds_stats:
./gds_stats -p <pid>
Real-World Application: Accelerating Data Loading with NVIDIA DALI

While benchmarks are crucial, the real test lies in practical applications. Enter NVIDIA DALI (Data Loading Library), a game-changer for optimizing data preprocessing in machine learning workflows.
Why NVIDIA DALI?
DALI addresses a common bottleneck in deep learning: data loading and preprocessing. By leveraging GPU acceleration, DALI ensures your model’s training pipeline operates at maximum efficiency, reducing overall training time.
Key benefits include:
- Maximizing GPU utilization
- Creating highly efficient pipelines
- Cross-framework compatibility (TensorFlow, PyTorch, MXNet)
Getting Started with NVIDIA DALI
Let’s create a simple data loader using DALI to process images from the COCO dataset:
- First, install DALI:
pip install nvidia-dali
- Download and extract the COCO dataset:
wget http://images.cocodataset.org/zips/train2017.zip
unzip train2017.zip -d /mnt/RDMA/2017/
- Here’s a Python script that creates a DALI pipeline for loading and processing images:
[Previous content remains unchanged]
Deep Dive: Understanding the NVIDIA DALI Loader Script
Let’s break down our DALI loader script to understand how it efficiently loads and processes images using GPU acceleration. This script demonstrates the power of combining GDS with DALI for high-performance data preprocessing from the PowerScale array.
Script can be downloaded from here – https://github.com/DellGEOS/GDS/tree/main
import os
import numpy as np
from nvidia.dali import pipeline_def
import nvidia.dali.fn as fn
import nvidia.dali.types as types
from nvidia.dali.plugin.pytorch import DALIGenericIterator
# Directory containing images
file_root = "/mnt/RDMA/train2017"
image_files = [os.path.join(file_root, f) for f in os.listdir(file_root) if f.endswith(('.jpg', '.png'))]
print(f"Found {len(image_files)} images.")
This section sets up our environment:
- We import necessary libraries, including DALI components.
- We define the directory where our images are stored (
file_root). - We create a list of all JPEG and PNG files in the directory.
class ExternalInputIterator:
def __init__(self, files, batch_size):
self.files = files
self.batch_size = batch_size
self.index = 0
def __iter__(self):
self.index = 0
return self
def __next__(self):
if self.index >= len(self.files):
raise StopIteration
batch = []
for _ in range(self.batch_size):
if self.index >= len(self.files):
break
with open(self.files[self.index], 'rb') as f:
batch.append(np.frombuffer(f.read(), dtype=np.uint8))
self.index += 1
return batch
The ExternalInputIterator class is a custom iterator that:
- Initializes with a list of files and a batch size.
- Implements the iterator protocol (
__iter__and__next__methods). - Reads files in batches, converting each image file to a NumPy array of bytes.
- This allows DALI to efficiently load data from our GDS-mounted storage.
@pipeline_def
def create_pipeline():
jpegs = fn.external_source(source=eii, dtype=types.UINT8)
images = fn.decoders.image(jpegs, device="mixed")
images = fn.resize(images, device="gpu", resize_x=224, resize_y=224)
return images
The create_pipeline function defines our DALI pipeline:
- It uses the
@pipeline_defdecorator to create a DALI pipeline. fn.external_sourcereads data from our custom iterator.fn.decoders.imagedecodes the JPEG data on a “mixed” device (CPU or GPU, whichever is more efficient).fn.resizeresizes the images to 224×224 pixels, explicitly on the GPU for speed.
# Create an instance of the ExternalInputIterator
batch_size = 32
eii = ExternalInputIterator(image_files, batch_size)
# Instantiate and build the pipeline
pipe = create_pipeline(batch_size=batch_size, num_threads=4, device_id=0)
pipe.build()
# Create a DALI iterator to test loading the images
dali_iter = DALIGenericIterator(pipe, ['data'], size=len(image_files))
# Test loading images using the DALI iterator
for data in dali_iter:
print("Data loaded successfully!")
print("Shape of the batch:", data[0]['data'].shape)
break # Just check the first batch
This final section puts everything together:
- We create an instance of our
ExternalInputIterator. - We instantiate and build the DALI pipeline, specifying batch size, number of threads, and GPU device ID.
- We create a
DALIGenericIterator, which wraps our pipeline for easy integration with PyTorch. - Finally, we test the iterator by loading one batch of images and printing its shape.
Key Benefits of This Approach
- GPU Acceleration: By using DALI and GDS, we offload image decoding and resizing to the GPU, significantly speeding up these operations.
- Efficient Data Loading: The
ExternalInputIteratorallows us to read data directly from our GDS-mounted storage, minimizing I/O bottlenecks. - Scalability: This approach can handle large datasets efficiently, as demonstrated by processing over 118,000 images from the COCO dataset.
- Framework Integration: While we’re using PyTorch in this example (via
DALIGenericIterator), DALI can integrate with other frameworks like TensorFlow as well. - Preprocessing Pipeline: The DALI pipeline allows us to easily chain multiple preprocessing steps (like decoding and resizing) that are executed efficiently on the GPU.
By leveraging NVIDIA DALI in combination with GDS, we’ve created a high-performance data loading and preprocessing pipeline. This approach can significantly reduce the time spent on data preparation, allowing researchers and data scientists to focus more on model development and training.
Running this script yields
Found 118287 images.
Data loaded successfully!
Shape of the batch: torch.Size([32, 224, 224, 3])
This output confirms that we’ve successfully created a data loading pipeline using GDS and DALI, efficiently processing a batch of 32 images, each resized to 224×224 pixels with 3 color channels.
Conclusion
By integrating NVIDIA GPU Direct Storage with Dell PowerScale and leveraging NVIDIA DALI, we’ve unlocked a new level of performance for data-intensive workloads. Our tests demonstrate significant improvements in throughput, confirming the effectiveness of GDS in optimizing storage-to-GPU data transfers.
This setup not only accelerates raw data transfer but also streamlines the entire data preprocessing pipeline, crucial for machine learning workflows. As we’ve seen with our DALI example, this combination allows for efficient, GPU-accelerated data loading from our PowerScale array and preprocessing, setting the stage for faster model training and inference.
For data scientists and AI researchers working with large datasets, this architecture provides a robust foundation in deep learning and analytics.
Next Steps
With this powerful infrastructure in place, the next logical step is to fine-tune a vision model using our optimized data pipeline. This could involve:
- Implementing a full training loop using a framework like PyTorch or TensorFlow
- Benchmarking training times with and without GDS to quantify performance gains
- Exploring more complex data augmentation techniques leveraging DALI’s capabilities


Leave a comment