data-versioning-patterns

Community

Reproducible data management patterns.

AuthorHermeticOrmus
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill addresses the critical challenge of managing and versioning datasets in machine learning projects, ensuring reproducibility and auditability.

Core Features & Use Cases

  • Version Control for Data: Integrates with tools like DVC to track datasets alongside code.
  • Reproducible Pipelines: Defines ML workflows in dvc.yaml for end-to-end reproducibility.
  • Experiment Management: Facilitates hyperparameter tuning and experiment tracking without cluttering Git history.
  • Time Travel for Datasets: Leverages Delta Lake for accessing historical data versions.
  • Lineage Tracking: Supports OpenLineage for emitting metadata about data processing jobs.
  • Use Case: Ensure that a specific model version can be retrained with the exact same data it was originally trained on, or audit the data used for a particular prediction.

Quick Start

Initialize DVC in your Git repository and track a dataset named 'raw_data.csv' by running dvc init followed by dvc add raw_data.csv.

Dependency Matrix

Required Modules

None required

Components

references

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: data-versioning-patterns
Download link: https://github.com/HermeticOrmus/LibreMLOps-Claude-Code/archive/main.zip#data-versioning-patterns

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.