cjk-aware-text-metrics

Community

Accurate token estimates for multilingual LLMs.

Authorshimo4228
Version1.0.0
Installs0

System Documentation

What problem does it solve?

Fixed chars-per-token constants fail for Japanese/Chinese/Korean text, leading to underestimation of tokens and downstream costs and rate limits.

Core Features & Use Cases

  • Detect CJK characters using Unicode ranges and compute weighted token counts that reflect multilingual content.
  • Apply to multilingual LLM preprocessing, chunking, and cost estimation for mixed-language documents.
  • Real-world use: process a Japanese document with mixed Latin text to produce accurate token counts for pricing and chunking.

Quick Start

Estimate token counts accurately for multilingual text by weighting CJK and Latin characters.

Dependency Matrix

Required Modules

None required

Components

Standard package

💻 Claude Code Installation

Recommended: Let Claude install automatically. Simply copy and paste the text below to Claude Code.

Please help me install this Skill:
Name: cjk-aware-text-metrics
Download link: https://github.com/shimo4228/claude-code-learned-skills/archive/main.zip#cjk-aware-text-metrics

Please download this .zip file, extract it, and install it in the .claude/skills/ directory.
View Source Repository

Agent Skills Search Helper

Install a tiny helper to your Agent, search and equip skill from 223,000+ vetted skills library on demand.