count-dataset-tokens
CommunityCount tokens in datasets accurately.
AuthorZurybr
Version1.0.0
Installs0
System Documentation
What problem does it solve?
This Skill provides clear guidance on how to accurately count tokens within datasets, especially when dealing with specific tokenizers and filtering requirements.
Core Features & Use Cases
- Token Counting: Accurately count tokens in HuggingFace or similar datasets.
- Data Filtering: Filter datasets by domain, category, or other specific fields.
- Tokenizer Application: Use specified tokenizers (e.g., Qwen, DeepSeek, GPT) for precise counting.
- Use Case: You need to determine the total token count for all 'technology' related articles in a large text dataset using the 'gpt2' tokenizer.
Quick Start
Use the count-dataset-tokens skill to count tokens in the 'wikipedia' dataset, filtering for the 'science' domain using the 'bert-base-uncased' tokenizer.
Dependency Matrix
Required Modules
transformersdatasets
Components
scriptsreferences
💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: count-dataset-tokens Download link: https://github.com/Zurybr/lefarma-skills/archive/main.zip#count-dataset-tokens Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.