databricks-synthetic-data-generation
CommunityCreate synthetic Databricks data with Faker.
AuthorRamVegiraju
Version1.0.0
Installs0
System Documentation
What problem does it solve?
Generate realistic synthetic data for Databricks workloads to test pipelines, demos, and data-science experiments.
Core Features & Use Cases
- Uses Faker and Spark to produce configurable synthetic datasets with realistic distributions and referential integrity suitable for testing ETL and analytics pipelines.
- Writes output to Databricks volumes in Parquet format for downstream Lakehouse architectures and SDP pipelines.
- Provides a repeatable workflow: generate data locally in scripts/generate_data.py, run on Databricks via MCP tooling, and reuse cluster/context IDs for faster iterations.
Quick Start
Write a local script at scripts/generate_data.py that generates customers, orders, and tickets, then run it on Databricks using the MCP tool.
Dependency Matrix
Required Modules
fakerholidays
Components
scripts
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: databricks-synthetic-data-generation Download link: https://github.com/RamVegiraju/databricks-samples/archive/main.zip#databricks-synthetic-data-generation Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.