Creating a Secure MLOps Environment with PowerScale and ClearML: A Practical Guide #4

Published by

on

Understanding the ClearML Agent: Your ML Pipeline Worker

So far we have covered the initial setup of the MLOps environment by establishing the directory structure on PowerScale and configuring the ML-Lab access zone with Smart Connect here and next we prepared PowerScale by creating a dedicated CSM admin user and group, enabling NFSv4 support, and configuring proper directory permissions.

We’ve set up the CSI Driver in Rancher by creating necessary namespaces and secrets, installing the Dell PowerScale CSI driver, and creating a storage class with specific PowerScale parameters. Finally, we deployed ClearML by creating a dedicated namespace and implementing its server components including MongoDB, Elasticsearch, File Server, and API Server/Web App, concluding with a successful verification of the ClearML server access.

In this blog we will configure the following;

  • Set up the ClearML Agent – a crucial component that acts as your ML experiment worker
  • Configure the agent settings to work with our ClearML server
  • Establish direct connectivity between the agent and PowerScale storage

remember, our goal is to directly interact with PowerScale storage while maintaining proper experiment tracking through the ClearML server. This approach offers better performance and simplified data management compared to traditional setups.

TL’DR!

What is the ClearML Agent?

The ClearML ecosystem has two main components: the server (which we set up in our last blog) and the agent. While the server acts as the control center and experiment tracker, the agent is the workhorse that:

  1. Executes training jobs
  2. Manages data access
  3. Reports metrics and results
  4. Handles resource allocation

Think of the ClearML agent as your dedicated ML experiment worker. Like a research assistant in a laboratory, the agent executes experiments, manages resources, and reports results back to the main system (Our ClearML server). It’s the component that actually runs your ML code, handles your data, and manages the training process etc.

Avoiding Duplicate Writes – PowerScale Intergration

Here’s where our setup gets interesting. In a typical ClearML setup, the agent would use the ClearML file server for all data storage. However, we’re creating a more sophisticated setup where:

  1. The agent reads training data directly from PowerScale mounted directories
  2. Models are saved directly to PowerScale storage
  3. Only metadata, metrics, and experiment tracking info go to the ClearML server

Lets say your training/fine tuning a language model, your psudo code might look something like the following.

# Your ML script
from clearml import Task

# Initialize ClearML tracking
task = Task.init(project_name='nlp_project', task_name='train_v1')

# Data accessed directly from PowerScale
train_data = load_dataset('/mnt/clearml/datasets/nlp/training.parquet')
model = train_model(train_data)

# Model saved directly to PowerScale
model.save('/mnt/clearml/models/nlp/model_v1/')

# Only metrics and metadata go to ClearML server
task.logger.report_scalar(title='accuracy', series='train', value=0.95)

In this example the agent:

  1. Reads the training data directly from PowerScale
  2. Executes the training process
  3. Saves the model back to PowerScale
  4. Reports metrics to ClearML server

By giving the agent direct access to PowerScale:

  1. Better Performance: Data doesn’t need to route through the ClearML server
  2. Enterprise Storage: Leverage PowerScale’s robust storage features
  3. Simplified Management: Direct access to your data infrastructure
  4. Reduced Complexity: ClearML server focused on its core strength – experiment tracking

Ok, let configure our environmnt

Setting Up ClearML Agent: A Direct Approach with PowerScale

1. Creating a Clean Python Environment

First, we need to create an isolated environment for our agent. This is like creating a clean workspace where we can install exactly what we need without affecting other Python applications on the system.

# Update our system's package list
sudo apt update

# Install Python virtual environment tools
sudo apt install python3-venv python3-pip -y

# Create a dedicated space for our ClearML agent
python3 -m venv ~/clearml-agent-env

# Step into this new environment
source ~/clearml-agent-env/activate

When you run these commands, you’re essentially creating a fresh, isolated Python workspace. It’s like having a new, clean lab bench where you can work without worrying about contamination from other projects.

2. Installing the ClearML Agent

Now that we have our clean environment, we can install the agent:

# Install the ClearML agent
pip install clearml-agent

# Verify it's installed correctly
clearml-agent --version

3. Connecting to PowerScale

Here’s where our setup gets interesting. We need to give our agent direct access to PowerScale storage. We’ll create mount points that align with our ML workflow:

# Create mount points for PowerScale
sudo mkdir -p /mnt/clearml/{datasets,models,artifacts,logs}

# Add PowerScale mounts to /etc/fstab
sudo echo "ml-lab.lab.local:/ifs/data/cls-01/ml-lab/datasets /mnt/clearml/datasets nfs defaults 0 0" >> /etc/fstab
sudo echo "ml-lab.lab.local:/ifs/data/cls-01/ml-lab/models /mnt/clearml/models nfs defaults 0 0" >> /etc/fstab
sudo echo "ml-lab.lab.local:/ifs/data/cls-01/ml-lab/artifacts /mnt/clearml/artifacts nfs defaults 0 0" >> /etc/fstab
sudo echo "ml-lab.lab.local:/ifs/data/cls-01/ml-lab/logs /mnt/clearml/logs nfs defaults 0 0" >> /etc/fstab

# Mount all
sudo mount -a

These mounts create direct pathways between our agent and PowerScale storage. Each mount point serves a specific purpose:

  • /datasets: For your training and validation data
  • /models: Where trained models are saved
  • /artifacts: For experiment outputs and results
  • /logs: For training logs and debugging information

4. Configuring the Agent

Now we’ll tell the agent how to work with our setup. Create a file at ~/clearml.conf:

sdk {
# Tell ClearML where to store things
storage {
cache {
# Use our PowerScale mount for artifacts
default_base_dir: "/mnt/clearml/artifacts"
}
# Enable direct storage access
direct_access: true
}

# Configure where to store metrics and plots
metrics {
plots_dir: "/mnt/clearml/artifacts/plots"
file_upload_threads: 4
}
}

This configuration tells the agent to:

  1. Use PowerScale directly for storage
  2. Store artifacts and plots in specific locations
  3. Optimize file uploads with multiple threads

5. Connecting to ClearML Server

We need 4 things to finialise the clearml-agent config

  1. web_server – end point
  2. api_server – end point
  3. files_server – end point
  4. Authentitcation

We can get our end point urls (node port end points for the moment) Under Service Discovery -> Service. So for our lab the endpoint will be –

  web_server: http://192.168.30.209:30080/
api_server: http://192.168.30.209:30008/
files_server: http://192.168.30.209:30081/

Authentication Setup


These credentials are like your agent’s ID badge – they prove to the ClearML server that this agent is authorized to access your experiments and data. You get these from your ClearML server’s settings page or from clear.ml if you’re using their hosted service.

you should have something similar to

api {
web_server: http://192.168.30.216:30080/
api_server: http://192.168.30.216:30008
files_server: http://192.168.30.216:30081
credentials {
"access_key" = "LUN1K4JN2H4H49WX0HJNDN8PDODPL6"
"secret_key" = "HCMdGVmwSsxN4-pu5OfRLK3_E7W5gz4FuSIIyZR-rFo0L4yzhJ2eL159X3Ba0N8Z3FE"
}
}

Finally, we’ll connect our agent to the ClearML server:

# Initialize ClearML configuration
clearml-init

When you run clearml-agent init, you’re essentially setting up the communication channels between your agent and the ClearML server components. Let’s break down each part of the configuration process:

root@ubuntu-client-ml-lab:~# clearml-agent init
CLEARML-AGENT setup process

Please create new clearml credentials through the settings page in your `clearml-server` web app,
or create a free account at https://app.clear.ml/settings/webapp-configuration

In the settings > workspace  page, press "Create new credentials", then press "Copy to clipboard".

Paste copied configuration here:


Enter user access key: LUN1K4JN2H4H49WX0HJNDN8PDODPL6
Enter user secret: HCMdGVmwSsxN4-pu5OfRLK3_E7W5gz4FuSIIyZR-rFo0L4yzhJ2eL159X3Ba0N8Z3FE
Detected credentials key="LUN1K4JN2H4H49WX0HJNDN8PDODPL6" secret="HCMd***"

Editing configuration file: /root/clearml.conf
Enter the url of the clearml-server's Web service, for example: http://localhost:8080 or https://app.clear.ml

WEB Host configured to: [https://app.clear.ml] http://192.168.30.216:30080/
API Host configured to: [http://192.168.30.216:30080/] http://192.168.30.216:30008
File Store Host configured to: http://192.168.30.216:30081

ClearML Hosts configuration:
Web App:
API: http://192.168.30.216:30008
File Store: http://192.168.30.216:30081

Verifying credentials ...
Credentials verified!
Default Output URI (used to automatically store models and artifacts): (N)one/ClearML (S)erver/(C)ustom [None] S

Default Output URI: http://192.168.30.216:30081
Enter git username for repository cloning (leave blank for SSH key authentication): []

Enter additional artifact repository (extra-index-url) to use when installing python packages (leave blank if not required):

New configuration stored in /root/clearml.conf
CLEARML-AGENT setup completed successfully.
root@ubuntu-client-ml-lab:~#

2. Server Configuration

WEB Host configured to: http://192.168.30.216:30080/
API Host configured to: http://192.168.30.216:30008
File Store Host configured to: http://192.168.30.216:30081

This section defines the three essential services your agent needs to interact with:

  • Web Host: Where you view your experiments in the browser
  • API Host: Where your agent sends and receives commands
  • File Store: Where experiment artifacts are stored

In our case, were using locally hosted services (192.168.30.216), with different ports for each service:

  • 30080 for web interface
  • 30008 for API
  • 30081 for file storage

3. Output URI Configuration

Output URI (used to automatically store models and artifacts): (N)one/ClearML (S)erver/(C)ustom [None] S
Default Output URI: http://192.168.30.216:30081

This setting tells your agent where to store outputs by default. You’ve chosen the ClearML Server option (S), which means artifacts will be stored on your local ClearML file server.

The Generated Configuration

successful configuration

When you run clearml-agent init, a configuration file is created (by default at /root/clearml.conf). Think of this file as your agent’s instruction manual for how to interact with the ClearML ecosystem. Here’s a quick breakdown of what’s inside:

  1. Credentials
    Your ClearML Access Key and Secret Key—like an agent’s ID badge—confirm you’re authorized to run experiments and store artifacts.
  2. Server Configuration
    • Web Host: The URL where you view and manage experiments in the browser.
    • API Host: The endpoint where the agent sends and receives commands.
    • File Store Host: The location for storing logs and artifacts.
  3. Output URI
    This setting tells the agent where to store model outputs by default (e.g., on the ClearML file server or a custom location).

What Comes Next?

Now that the ClearML Agent is configured, you can:

  1. Start the Agent Service

    clearml-agent daemon --queue default This command “listens” for jobs in the default queue and executes them automatically.
  2. Integrate PowerScale

    Edit the /root/clearml.conf (or your local .clearml.conf) to ensure your agent uses the PowerScale mount points

    sdk { storage { cache { default_base_dir: "/mnt/clearml/artifacts" } direct_access: true } }

  3. Test Your Setup
    Try a small test script to confirm everything’s working.

    from clearml import Task task = Task.init(project_name='test', task_name='first_run') task.mark_completed()

    If you see this task appear in your ClearML Web App, congratulations—you’ve successfully connected the agent, ClearML server, and PowerScale storage!

Wrapping Up

With the agent up and running, you now have a powerful workhorse at your disposal—one that can pull data directly from PowerScale, train your models, save artifacts back to PowerScale, and seamlessly report experiment metrics to the ClearML server. By avoiding duplicate writes and letting the agent access your data infrastructure directly, you streamline storage management and gain better performance.

In our next installment, I’ll show you how to fine-tune your ClearML setup for more complex workflows and explore best practices for managing large datasets on PowerScale. Until then, enjoy your newly empowered MLOps environment, and happy experimenting!

Leave a comment