count-dataset-tokens

Community

Count tokens in datasets accurately.

AuthorZurybr
Version1.0.0
Installs0

System Documentation

What problem does it solve?

This Skill provides clear guidance on how to accurately count tokens within datasets, especially when dealing with specific tokenizers and filtering requirements.

Core Features & Use Cases

  • Token Counting: Accurately count tokens in HuggingFace or similar datasets.
  • Data Filtering: Filter datasets by domain, category, or other specific fields.
  • Tokenizer Application: Use specified tokenizers (e.g., Qwen, DeepSeek, GPT) for precise counting.
  • Use Case: You need to determine the total token count for all 'technology' related articles in a large text dataset using the 'gpt2' tokenizer.

Quick Start

Use the count-dataset-tokens skill to count tokens in the 'wikipedia' dataset, filtering for the 'science' domain using the 'bert-base-uncased' tokenizer.

Dependency Matrix

Required Modules

transformersdatasets

Components

scriptsreferences

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: count-dataset-tokens
Download link: https://github.com/Zurybr/lefarma-skills/archive/main.zip#count-dataset-tokens

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.