In our last two blogs, we successfully demonstrated how to enhance PowerScale performance using TCP Multipath and and RDMA Multipath – allowing clients to access multiple PowerScale nodes at once. With those configurations in place, we saw significant improvements in redundancy and overall throughput.
Now, we’re pushing further. This time, we’re testing a variety of network configurations—TCP, TCP with Multipath (MP), RDMA, and RDMA with Multipath—to evaluate how they perform and verify that in fact that they work !. In particular, we wanted to see how well RDMA fares compared to TCP under both standard and multipath conditions.
Let’s dive into the setup, the results, and what we learned along the way.
Test Rig
For these tests, we built an end-to-end 100GbE test scenario with high-performance hardware:

- PowerScale Array: 4x PowerScale F210 nodes, each equipped with 4x 3.84TB NVMe drives.
- Switch: 100GbE switch connecting the PowerScale nodes to the test server.
- Test Server: Dell PowerEdge R7525, 128 Core/256 Threads AMD Threadripper CPU, outfitted with dual 100GbE NICs, running Red Hat Enterprise Linux 8 (RHEL8). For these tests only 1x 100GbE was active 😦
Please note: These are not official performance figures ! rather the resulting of my testing in a 100GbE end-to-end config – This test rig is designed to handle a high-bandwidth scenario, enabling us to push the limits of each configuration (TCP, TCP-MP, RDMA, and RDMA-MP) and achieve true 100GbE throughput from our Server to the PowerScale cluster by way of testing and verification.
Testing Methodology: We are aiming for high concurrency in the test to allow the multipaths configurations to shine – a single read and write stream would not yield much performance difference, across multipath, and is not really why would would want to implement a multipath topology
We tested four different setups:
- TCP – Standard network configuration.
- TCP with Multipath (TCP-MP) – Using multiple paths for redundancy and improved performance.
- RDMA – Remote Direct Memory Access, for low-latency, high-throughput networking.
- RDMA with Multipath (RDMA-MP) – Combining RDMA’s strengths with multipath redundancy.
For a detailed guide on configuring RDMA, check out this useful resource: NFS over RDMA Cluster Configuration.
We configured the fstab file to mount each network setup, leveraging the nconnect option to create multiple streams, the remoteports parameter to manage port allocation efficiently on the PowerScale array side, we set our read and write block size of 1MB.
For the all configurations, we set nconnect=32, which creates 32 TCP streams, with 8 streams per node in the PowerScale cluster, the option, remoteports=192.168.1.201-192.168.1.204 (PowerScale Cluster Node IP’s) was added for multipath tests, this ensures that data is transmitted across the network to all PowerScale nodes to achieve aggregate throughput.
Test 1 – NFS Mount Over TCP – 32 TCP Streams (no multipath – control test)
192.168.1.201:/ifs/data/f210/perforce/nfs_depot /mnt/tcp nfs rw,vers=3,nconnect=32,rsize=1048576,wsize=1048576,timeo=10,soft 0 0
Test 2 – NFS Mount Over TCP – 32 TCP Streams over multipath (PowerScale destination IP’s configured)
192.168.1.201:/ifs/data/f210/perforce/nfs_mp_depot /mnt/tcp-multipath nfs rw,vers=3,nconnect=32,localports=192.168.1.21-192.168.1.22,remoteports=192.168.1.201-192.168.1.204,rsize=1048576,wsize=1048576,timeo=10,soft 0 0
Test 3 – NFS Mount Over RDMA – 32 TCP Streams (control test)
192.168.1.201:/ifs/data/f210/perforce/nfs_rdma_depot /mnt/rdma nfs proto=rdma,port=20049,vers=3,nconnect=32,rsize=1048576,wsize=1048576,timeo=10,soft 0 0
Test 2 – NFS Mount Over RDMA – 32 TCP Streams over multipath (PowerScale destination IP’s configured)
192.168.1.201:/ifs/data/f210/perforce/nfs_mp_rdma_depot /mnt/rdma-multipath nfs proto=rdma,port=20049,vers=3,nconnect=32,remoteports=192.168.1.201-192.168.1.204,rsize=1048576,wsize=1048576,timeo=10,soft 0 0
Verification
All directories are now mounted as shown
[root@p4-01 fio]# df -h
192.168.1.201:/ifs/data/f210/perforce/nfs_mp_rdma_depot 222T 172T 21T 90% /mnt/rdma-multipath
192.168.1.201:/ifs/data/f210/perforce/nfs_depot 222T 172T 21T 90% /mnt/tcp
192.168.1.201:/ifs/data/f210/perforce/nfs_mp_depot 222T 172T 21T 90% /mnt/tcp-multipath
192.168.1.201:/ifs/data/f210/perforce/nfs_rdma_depot 222T 172T 21T 90% /mnt/rdma
Testing with FIO
Using the Linux fio tool, we will run a simple read and write benchmark test to the mounted directories.
Here is the contents of one of my fio config files (Test -1 NFSoTCP, the config is the same for all tests just the directory we are reading and writing to is different)
There are 10 Write and Read files of size 10GB each – each read and write operation will run concurrently i.e. we will concurrently Write 100GB to the PowerScale Cluster and Read it back out.
[root@p4-01 fio]# cat fio-sequential-test-TCP.fio
[global]
ioengine=libaio
direct=1
bs=1M # Block size set to 1MB
size=10G # File size for each test file
numjobs=1 # One job per file
iodepth=128 # I/O depth for concurrency
runtime=60 # Run for 60 seconds
time_based=1 # Ensure the test runs for a fixed duration
group_reporting=1 # Provide aggregated stats
# Simultaneous Writes
[write1]
rw=write # Write operation
filename=/mnt/tcp/testfile_write1 # First write file
[write2]
rw=write
filename=/mnt/tcp/testfile_write2 # Second write file
[write3]
......
# Simultaneous Reads
[read1]
rw=read # Read operation
filename=/mnt/tcp/testfile_read1 # First read file
[read2]
rw=read
filename=/mnt/tcp/testfile_read2 # Second read file
[read3]
..... (you get the idea!)
First lets verify before looking at some performance findings
Test 1 – NFS Mount Over TCP – 32 TCP Streams
Lets perform an I/O test to the /mnt/tcp directory – as the directory name suggests in test 1 – this directory is mounted using a standard TCP mount
We will run a benchmark read and write test to this directory – we should see a single PowerScale node handling all I/O traffic.
Over on the PowerScale array we will run the command isi statistics system -n=all --format top to view the incoming traffic for the cluster – Here we can see all traffic is being served by Node 1 in our cluster (notice NFS is 0 in terms of activity for all other 3 nodes) – as expected
Node CPU SMB FTP HTTP NFS HDFS S3 Total NetIn NetOut DiskIn DiskOut
All 12.6% 0.0 0.0 0.0 6.4G 0.0 0.0 6.4G 1.1G 6.8G 27.9M 4.7G
1 28.7% 0.0 0.0 0.0 6.4G 0.0 0.0 6.4G 1.1G 6.8G 25.8M 1.5G
2 7.6% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0M 1.8G
3 6.4% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 9.8k 488.6M
4 7.6% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 12.3k 823.2M
Test 2 – NFS Mount Over TCP Multi-Path – 32 TCP Streams
Lets perform an I/O test to the /mnt/tcp-multipath directory – as the directory name suggests in test 2 – this directory is mounted using a standard TCP mount but via the multipath driver.
We will run a benchmark read and write test to this directory – we should see a all PowerScale nodes handling I/O traffic.
Over on the PowerScale array we will run the command isi statistics system -n=all --format top to view the incoming traffic for the cluster – Here we can see all traffic is being served by each node in the cluster – as expected
Note we see a great throughput of 8.2GBs to/from the cluster as expected – 8.2GBs vs 6.2GBs (as per the single stream test above)
Node CPU SMB FTP HTTP NFS HDFS S3 Total NetIn NetOut DiskIn DiskOut
All 36.7% 0.0 0.0 0.0 8.2G 0.0 0.0 8.2G 1.6G 7.0G 443.9M 7.6G
1 23.0% 0.0 0.0 0.0 724.9M 0.0 0.0 724.9M 1.0G 1.9M 30.6M 1.9G
2 36.6% 0.0 0.0 0.0 3.0G 0.0 0.0 3.0G 1.3M 3.2G 162.6M 1.9G
3 60.1% 0.0 0.0 0.0 3.7G 0.0 0.0 3.7G 1.5M 3.7G 170.6M 1.9G
4 27.1% 0.0 0.0 0.0 892.2M 0.0 0.0 892.2M 591.4M 1.1M 80.0M 1.9G
Test 3 – NFS Mount Over RDMA – 32 TCP Streams
Next, lets perform an I/O test to the /mnt/rdma directory – as the directory name suggests in test 3 – this directory is mounted using a RDMA
We will run a benchmark read and write test to this directory – we should see one PowerScale nodes handling I/O traffic.
Over on the PowerScale array we will run the command isi statistics system -n=all --format top to view the incoming traffic for the cluster – Here we can see all traffic is being served one node in the cluster – again, as expected
note: in this single stream test , RDMA single stream is performing on a par with TCP Multipath (interesting)
Node CPU SMB FTP HTTP NFS HDFS S3 Total NetIn NetOut DiskIn DiskOut
All 30.6% 0.0 0.0 0.0 7.7G 0.0 0.0 7.7G 206.4 88.8 114.3M 6.6G
1 99.5% 0.0 0.0 0.0 7.7G 0.0 0.0 7.7G 62.4 88.8 28.5M 1.7G
2 7.5% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 60.0 0.0 28.0M 1.7G
3 8.2% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 24.0 0.0 25.9M 1.5G
4 7.1% 0.0 0.0 0.0 0.0 0.0 0.0 0.0 60.0 0.0 32.0M 1.8G
Test 4 – NFS Mount Over RDMA Multipath – 32 TCP Streams
Lets perform an I/O test to the /mnt/rdma-mp directory – as the directory name suggests in test 4 – this directory is mounted using RDMA Multipath
We will run a benchmark read and write test to this directory – we should see all PowerScale nodes handling I/O traffic.
Over on the PowerScale array we will run the command isi statistics system -n=all --format top to view the incoming traffic for the cluster – Here we can see traffic is being served by all nodes in the cluster – again, as expected
Note, RDMA-Multipath is excelling here at 14.6Bs
Node CPU SMB FTP HTTP NFS HDFS S3 Total NetIn NetOut DiskIn DiskOut
All 22.9% 0.0 0.0 0.0 14.6G 0.0 0.0 14.6G 76.1 104.8 115.2M 80.3k
1 27.2% 0.0 0.0 0.0 690.8M 0.0 0.0 690.8M 50.4 104.8 33.1M 9.8k
2 14.8% 0.0 0.0 0.0 4.4G 0.0 0.0 4.4G 25.7 0.0 25.3M 24.6k
3 31.3% 0.0 0.0 0.0 5.0G 0.0 0.0 5.0G 0.0 0.0 29.1M 16.4k
4 18.3% 0.0 0.0 0.0 4.4G 0.0 0.0 4.4G 0.0 0.0 27.7M 29.5k
Benchmark Test Findings
1. RDMA-MP Excels with High Concurrency Workloads
- RDMA-MP shows superior performance when handling a large number of simultaneous read and write operations.
- The read bandwidth of 93.38 Gbps indicates that RDMA-MP can effectively utilize nearly the full capacity of the 100 Gbps network infrastructure under high-concurrency conditions.
2. TCP-MP Performance Improves but Is Outpaced by RDMA-MP
- While TCP-MP also benefits from increased workload streams, achieving 65.81 Gbps in read bandwidth, it does not reach the same level of performance as RDMA-MP.
- TCP’s inherent overhead and protocol inefficiencies become more pronounced under high-concurrency workloads compared to RDMA.
3. RDMA-MP Offers Superior Write Performance
- RDMA-MP continues to provide higher write bandwidth (20.50 Gbps) than TCP-MP (17.38 Gbps).
- The advantages of RDMA in bypassing the kernel and reducing CPU overhead contribute to its superior write performance.
4. Concurrency Amplifies RDMA Benefits
- The results suggest that RDMA protocols scale better with increased concurrency compared to TCP.
- RDMA’s ability to handle multiple simultaneous data transfers efficiently makes it more suitable for workloads with high levels of parallelism.
Recommendations
1. Leverage RDMA-MP for High-Concurrency Workloads
- For applications that involve numerous simultaneous read and write operations, RDMA-MP is the recommended configuration.
- Workloads such as big data analytics, high-performance computing, and large-scale simulations can benefit from RDMA-MP’s superior performance under high concurrency.
2. Optimize Applications to Increase Concurrency
- If feasible, modify applications to perform more parallel read/write operations to take full advantage of RDMA-MP’s capabilities.
- Ensure that the application architecture supports multithreading or multiprocessing to increase the number of simultaneous I/O streams.
3. Monitor System Resources
- Ensure that the CPU, memory, and storage subsystems are not bottlenecks when increasing concurrency.
- Monitor system metrics to verify that resources are scaled appropriately to handle the increased workload without degradation.
4. Fine-Tune RDMA Parameters
- Adjust RDMA-specific settings such as the number of queue pairs, completion queues, and other RDMA verbs parameters to optimize performance for high-concurrency workloads.
5. Consider Network Infrastructure
- Verify that the network infrastructure can handle high-throughput RDMA traffic.
- Ensure switches and NICs support RDMA over Converged Ethernet (RoCE) or InfiniBand as applicable.
- Check for any network contention or congestion that could affect performance.
Additional Observations
- RDMA-MP’s ability to nearly saturate a 100 Gbps link demonstrates its efficiency in handling high-bandwidth requirements.
- TCP’s overhead becomes more limiting as the number of concurrent operations increases, highlighting RDMA’s advantages in low-latency, high-throughput scenarios.
- Write performance gains are less pronounced than read performance gains but still favor RDMA-MP.
Summary
RDMA-MP significantly outperforms TCP-MP in both read and write operations under high-concurrency workloads. RDMA-MP’s read bandwidth reached 93.38 Gbps, closely approaching the maximum capacity of the 100 Gbps network link, and surpassed TCP-MP’s read bandwidth by a substantial margin.
These findings suggest that RDMA-MP can be a great topology and configuration for environments where high levels of simultaneous read and write operations occur. By leveraging RDMA-MP, you can maximize the utilization of your high-speed network infrastructure and improve the performance of applications that are designed to handle concurrent workloads.


Leave a comment