databricks-synthetic-data-generation

Community

Create synthetic Databricks data with Faker.

AuthorRamVegiraju
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Generate realistic synthetic data for Databricks workloads to test pipelines, demos, and data-science experiments.

Core Features & Use Cases

  • Uses Faker and Spark to produce configurable synthetic datasets with realistic distributions and referential integrity suitable for testing ETL and analytics pipelines.
  • Writes output to Databricks volumes in Parquet format for downstream Lakehouse architectures and SDP pipelines.
  • Provides a repeatable workflow: generate data locally in scripts/generate_data.py, run on Databricks via MCP tooling, and reuse cluster/context IDs for faster iterations.

Quick Start

Write a local script at scripts/generate_data.py that generates customers, orders, and tickets, then run it on Databricks using the MCP tool.

Dependency Matrix

Required Modules

fakerholidays

Components

scripts

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: databricks-synthetic-data-generation
Download link: https://github.com/RamVegiraju/databricks-samples/archive/main.zip#databricks-synthetic-data-generation

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.