cjk-aware-text-metrics
CommunityAccurate token estimates for multilingual LLMs.
Software Engineering#multilingual#cost-estimation#token-estimation#cjk#text-metrics#llm-pipelines#unicode-detection
Authorshimo4228
Version1.0.0
Installs0
System Documentation
What problem does it solve?
Fixed chars-per-token constants fail for Japanese/Chinese/Korean text, leading to underestimation of tokens and downstream costs and rate limits.
Core Features & Use Cases
- Detect CJK characters using Unicode ranges and compute weighted token counts that reflect multilingual content.
- Apply to multilingual LLM preprocessing, chunking, and cost estimation for mixed-language documents.
- Real-world use: process a Japanese document with mixed Latin text to produce accurate token counts for pricing and chunking.
Quick Start
Estimate token counts accurately for multilingual text by weighting CJK and Latin characters.
Dependency Matrix
Required Modules
None requiredComponents
Standard package💻 Claude Code Installation
Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.
Please help me install this Skill: Name: cjk-aware-text-metrics Download link: https://github.com/shimo4228/claude-code-learned-skills/archive/main.zip#cjk-aware-text-metrics Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
Agent Skills Search Helper
Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.