OneFS MetaDataIQ – An Intro

Published by

on

MetaDataIQ: Simplifying metadata management and drastically increasing query times for Unstructured Data

In today’s data-driven landscape, efficient management of unstructured data is a cornerstone for enterprise innovation. Dell PowerScale MetaDataIQ introduces features to tackle the challenges of metadata management, offering enhanced performance and deeper insights into data operations.

By way of intro (as we’ll be cover a lot more of the technicial details and implementation in upcoming blogs and videos) in this blog we’ll take a look at how PowerScale MeatDataIQ will transform metadata handling in your environment and why it stands out in the unstructured data ecosystem.


Enhanced Metadata for Unstructured Data

PowerScale MetaDataIQ (I’m going to call it MDX from now on!) fundamentally changes how metadata is handled in unstructured environments.

before we get into any details (or more importantly what possibilities this features unlocks) lets talk about Tree Walks. Traditionally, gathering metadata required a process known as a tree walk, which involves traversing directory structures to collect file information such as names, sizes, and timestamps. While effective, this approach is resource-intensive and scales poorly in environments with billions of files.

Third party companies eisting in this space to perform tree walks , collect metadata and export it to external databases for offiline queries – such as;

  • TreeSize Professional: Versatile tool supporting most NAS vendors through CIFS/SMB
  • Disk Savvy Enterprise: Good for mid-sized environments, works with common NAS platforms
  • Northern Storage Suite: Multi-vendor support focusing on enterprise NAS platforms

However, each of these tools require periodic tree walks of the filesystem to stay up to date – so as one could imagine, if you have a billion files, these solutions would have to touch a billion files every day.


Key Innovations

Avoiding Tree Walks with Pre-Indexed Metadata

PowerScale MDX eliminates the need for time-consuming tree walks by indexing metadata during file creation or modification.

How ?

Firstly, MDX runs within OneFS and does not rely on manual indexing (which would have to perform a scan / tree walk, as we already discussed)

With power scale, MDX, we leverage an internal OneFS feature called changelist, which leverages diffs in snapshots – at a very high level.

  • A changelist is essentially a catalog of the attributes and objects which changed between two checkpoints. For example, a list of the files, dirs, etc which were added, removed, modified, etc, typically between two file system snapshots.
  • The SnapshotIQ framework is leveraged as the underlying mechanism for taking the required snapshots.
  • Historically, changelists were primarily utilized by SyncIQ as the foundation for differential replication. However, they are now used more widely, such as by FSAnalyze for InsightIQ, and IndexUpdate for the SmartPools FilePolicy job and now MetaDataIQ

As always, Nick Trembee has a great write up here


MetadataIQ capitalizes on the OneFS ChangeListCreate job to track metadata changes, update indices, and ensure efficient cataloguing by

  1. Snapshot and ChangeList Processing:
    • Create a new snapshot (S2).
    • Derive a changelist (s1_s2) capturing deltas between snapshots S1 and S2.
  2. Incremental Batch Updates:
    • Extract metadata changes in batches and update the ElasticSearch database.

The changelist is stored in the Elasticsearch database and then from there any queries you’re doing it’s elastic search database is not affecting power scale.

This pre-indexed metadata is stored in ElasticSearch or a similar database, allowing:

  • Direct Queries: Users can search metadata instantly without traversing directory hierarchies.
  • Resource Efficiency: System resources are conserved, enabling faster and more scalable operations.
  • Advanced Metadata Insights: Details like file tags, object labels, and custom attributes are indexed upfront.
  • AI/ML Integration: Metadata is enriched through AI pipelines that automatically tag and classify files, such as applying image recognition to generate labels or extracting document keywords.
  • Background Processing: These enhancements occur outside real-time workflows, ensuring seamless query performance.

Lets take a look at some possible use cases….


Use Case 1: Faster Indexing and Search

For enterprises managing massive unstructured datasets, PowerScale MDX offers a paradigm shift:

  • Accelerated Data Access: Indexing during file operations eliminates delays associated with metadata retrieval.
  • Improved Operational Efficiency: Enterprises can quickly access relevant data without compromising system performance.

Use Case 2: Custom Document Loaders with MDX

Custom document loaders allow you to fully leverage the flexibility of LangChain and LlamaIndex, tailoring them to the unique requirements of your data sources and formats.

Dell’s forthcoming document loader will leverage MDX’s metadata capabilities to greatly enhance and speed up data ingestion and data re-loading etc (as always a separate more details blog coming on this.!)

  • Intelligent Chunking: Data read from PowerScale can be pre-processed into metadata-enriched chunks for upstream workflows.
  • RAG Workloads: This feature directly supports Retrieval-Augmented Generation (RAG) frameworks by enabling efficient indexing, retrieval, and utilization of unstructured data.

Simplified Customer Experience

PowerScale MDX aims to streamline metadata management with features that work out-of-the-box. Customers benefit from:

  • Seamless Integration: ElasticSearch-based indexing and AI pipelines require no complex setup.
  • Customizable Workflows: Advanced users can build scripts and pipelines to further analyze and utilize metadata.
  • Future-Proofing: MDX’s AI-driven enrichment capabilities ensure enterprises stay ahead in data management innovation.

With PowerScale MDX, Dell positions itself as a leader in metadata-driven unstructured data management. Key differentiators include:

  • Proactive Metadata Management: Features like AI enrichment and pre-indexing set MDX apart from traditional storage solutions.
  • Customer-Centric Design: MDX simplifies complex tasks, delivering capabilities typically requiring custom implementations as built-in features.
  • Support for Emerging Workloads: The integration of MDX with RAG frameworks and the ability to process metadata-rich document chunks highlight its adaptability.

PowerScale MDX is more than a metadata management tool; it’s a technology that empowers enterprises to unlock the full potential of their unstructured data. With innovations like pre-indexed metadata, AI enrichment, and federated query support, PowerScale MDX allows for efficient, scalable, and intelligent metadata management. Whether accelerating data access or enabling advanced workflows like document chunking for RAG, PowerScale MDX ensures your data works harder, faster, and smarter for your business.

Stay tuned more to come …..

Leave a comment