Creating a Secure MLOps Environment with PowerScale and ClearML: A Practical Guide #2

Published by

on

Step 1 – A well Organized folder structure

Before configuring networking or access zones, we need to establish a well-organized directory structure. This foundation will support our entire MLOps infrastructure, from data storage to Kubernetes integration etc, the workflow typically would be the following

  1. folder structure
  2. networking and dns
  3. Smart connect
  4. Access zones

Understanding the Structure

Base Path Convention

We will follow PowerScale best practices for Kubernetes CSI implementation:

/ifs/data/<cluster_name>/

This structure simplifies:

  • Multi-cluster management
  • CSI driver configuration
  • Storage class implementation
  • Future scalability

Our Implementation

/ifs/data/cls-01/                                # Cluster root
└── ml-lab/ # ML access zone root
├── artifacts/ # Training outputs & metrics
├── datasets/ # ML training data
├── logs/ # System & application logs
├── models/ # Trained models
└── rke2-mlops/ # Kubernetes PVC storage

Directory Purposes

  1. artifacts/ – Training outputs, metrics, and results
    • Experiment results
    • Model performance data
    • Visualization artifacts
  2. datasets/ – Training and validation data
    • Raw datasets
    • Processed data
    • Test datasets
  3. logs/ – Operational logging
    • Training logs
    • System logs
    • Application logs
  4. models/ – Model storage
    • Trained models
    • Model checkpoints
    • Production models
  5. rke2-mlops/ – Kubernetes storage
    • Persistent Volume Claims
    • RKE2 cluster storage
    • Container persistent storage

# Create base structure
mkdir -p /ifs/data/cls-01/ml-lab

# Create ML directories
cd /ifs/data/cls-01/ml-lab
mkdir -p artifacts datasets logs models rke2-mlops


# Set appropriate permissions
chmod 755 /ifs/data/cls-01/ml-lab
chmod 750 /ifs/data/cls-01/ml-lab/{artifacts,datasets,logs,models,rke2-mlops}


# Verify structure
tree /ifs/data/cls-01/ml-lab

ok good, now that we have our folder structure – we can configure networking, as this will be a requirement for the creation of the access zone.

Step 2 – Networking

Before diving into ML workflows, containers, and AI tools, we need a solid foundation. We’re starting with the foundation and framework that will support everything else. We need proper networking before data can flow. Our MLOps workloads need:

  • Dedicated bandwidth for large datasets
  • Predictable performance for training
  • Isolated traffic for security

We will be utilizing two key features available to us in OneFS – Access Zones and Smart Connect. I am referencing the following design / best practice guide available to anyone – Dell PowerScale: Network Design Considerations | Dell Technologies Info Hub

we will implement and test the following

The Network we will be using for all this will be a 192.168.30.x network – we want everting to follow best practice and work via DNS (This is a requirement also of smart connect). Our DNS entry point to the ML Lab will be ml-lab.lab.local – so our smart connect zone will need to be configured accordingly.

Our network config will be as follows

  • PowerScale Interface – Ext2
  • Subnet(30) – 192.168.30.x – This is our IP Subnet for the ml-lab and access zone
  • Smart Connect Service IP – 192.168.30.30
  • Node Pool IP Range – 192.168.30.31-35
  • Smart connect DNS Entry Pointml-lab.lab.local

SmartConnect: Network Load Balancing and DNS Resolution

SmartConnect manages client connections to your PowerScale cluster through DNS-based load balancing, there is an important distinction in how network traffic actually flows.

How SmartConnect Really Works (Hint, it’s a DNS Server!)

  1. Client requests connection to SmartConnect service name – in our case ml-lab.lab.local
  2. DNS resolution occurs:
    • SmartConnect service name is NOT an IP that serves traffic i.e. in our configuration 192.168.30.30 does not serve traffic – rather that is the dns endpoint the client reaches
    • Instead, it returns an IP address from the pool of available node IP addresses
    • The returned IP belongs to a specific node in the cluster
  3. Actual data traffic flows directly through node IP addresses, not through a service IP
  4. If a node becomes unavailable, DNS will resolve to different node IP addresses

Key Technical Points

  • Smart Connect IP is defined at the Subnet level
  • SmartConnect service name is purely a DNS entry point
  • No traffic passes through a SmartConnect “Service IP”
  • Real data transfer occurs directly with node IP addresses
  • IP pool configuration is critical for proper load distribution
  • DNS TTL settings affect how quickly connection changes propagat

Configuring Networking, Access Zone and SmartConnect

We need to firstly create our groupnet

Step 1: PowerScale Network Pool Configuration

#Configure the network pool with SmartConnect (command line)
isi network subnets create subnet-mlops \
–addr=192.168.30.0 \
–gateway=192.168.30.1 \
–netmask=255.255.255.0 \
–sc-service-addr=192.168.30.30 \
–sc-service-name=ml-lab.lab

#Configure the network pool with SmartConnect (GUI)

Before creating the network pool we will now have to create our Access Zone

Step 2: Configure Access Zone

lets created the Access Zone and assign it to our ml-lab network – before we configure DNS and test the set-up (I always prefer to do this step though the gui for some reason!)

In the GUI go to Access -> Access Zones

here we will create out “ml-lab” access zone – and confiure the zone base directory to be /ifs/data/cls-01/ml-lab

Next lets assign this access zone to the to the “ml-ops” Pool (Remember, Access Zones are defined at the Pool Level)

Ok, were good to finish our network config

# Create network pool
isi network pools create mlpool \
–ranges=192.168.30.31-192.168.30.35 \
–ifaces=ext-2 \
–sc-dns-zone=ml-lab.lab.local \
–sc-connect-policy=round_robin \
–sc-dns-zone-aliases=mlops.lab.local \
–description=”ML Lab Network Pool”

# Create network pool (GUI)

Ok, now were good to configure and test DNS for SmartConnect

PowerScale DNS Configuration Best Practices for ML-Ops

After setting up our network pool and access zones, proper DNS configuration is crucial for a robust ML-Ops environment. Let’s dive into DNS delegation best practices and implementation.

DNS Architecture Overview

In our ML-Ops setup, we’re implementing the following DNS structure:

  • Primary Zone: lab.local
  • SmartConnect Zone: ml-lab.lab.local
  • Service IP (SSIP): 192.168.30.30

DNS Delegation Best Practices

Again, im referring to the following best pratcices and implementation guide – DNS delegation best practices | Dell PowerScale: Network Design Considerations | Dell Technologies Info Hub

Use Address (A) Records, Not Direct IP Delegation

Always delegate to Address (A) records rather than IP addresses directly. This approach simplifies:

  • Business continuity management
  • Maintenance operations
  • Disaster recovery scenarios

ok so, first things first let get an A record pointing to our Smart Connect Service IP (SSIP) – we’ll then delegate to this for our DNS entry.

Next, right click on the lab.local folder object and choose “new delegation”

The FQDN i’ve chosen is ml-lab.lab.local – this will point to the smart connect service addresses A record of cls01-ssip.lab.local – 192.168.30.30. Again, what are we doing here ? we’re saying that is i lookup ml-lab.lab.local, redirect that query to cls01-ssip.lab.local (192.168.30.30) – as we’ve already discussed, smart connect is a dns server, so it will in turn take this query (forwarded from our local DNS server to it) and return and return an ip address for ml-lab.lab.local from the Address Pool available. If you think of it…. at no stage have we defined an actual storage ip address for ml-lab.lab.local anywhere in our lab setup, nor should we !

Implement One Name Server Record Per Zone

While we are only creating one for our Ml Ops lab. Create individual delegations for each SmartConnect zone or alias. This enables:

  • Granular failover control
  • Independent zone management
  • Workflow isolation

Implementation Steps (Powershell)

1. Configure Windows DNS Server

# Create Primary Forward Lookup Zone
Add-DnsServerPrimaryZone -Name "lab.local" -ZoneFile "lab.local.dns"

# Create SmartConnect Service IP A Record
Add-DnsServerResourceRecordA -Name "ml-lab-ssip" -ZoneName "lab.local" -IPv4Address "192.168.30.30"

# Add SmartConnect Zone Delegation
Add-DnsServerZoneDelegation -Name "lab.local" -ChildZoneName "ml-lab" -NameServer "ml-lab-ssip.lab.local" -IPAddress "192.168.30.30"

2. Verify DNS Configuration

Windows DNS config

Lets Test DNS Resolution for this alias from our Linux Host that will be running the Clear-ML Agent – remember we expect an ip address to be retuned from the node pool 192.168.30.31-35

192.168.30.31 retuned, success!!, were ready to test an export and confirm zone access

mount -t nfs ml-lab.lab.local:/ifs/data/cls-01/ml-lab/datasets /mnt/datasets
and verify with the df -h command – As a first step we can see our datasets folder is mounted – we’ll obviously make this persistent by configuring the fstab (along with adding the other folders) – but as a start we know our environment is functioning as expected.


Next up: Part 3: Container Orchestration “Orchestrating ML: Rancher and PowerScale Integration” Persistent volume claims for ML workloads, PowerScale CSI driver setup, Storage class configuration and Dynamic volume provisioning for ClearML Server

One response to “Creating a Secure MLOps Environment with PowerScale and ClearML: A Practical Guide #2”

  1. […] our previous blog, we laid the groundwork for our MLOps environment by establishing a directory structure on […]

    Like

Leave a comment