No description

Python 66.3%
Rust 33.7%

Find a file

Logiar 57e4f433a2 Build executable as tc		2026-05-21 09:18:26 +00:00
.forgejo/workflows	Build executable as tc	2026-05-21 09:18:26 +00:00
docs/plans	Build executable as tc	2026-05-21 09:18:26 +00:00
rust/tokencount-file	Build executable as tc	2026-05-21 09:18:26 +00:00
samples	Fix sample index paths	2026-05-20 14:34:45 +00:00
scripts	Add embedded license notice command	2026-05-21 07:42:46 +00:00
src/tokencounter	Fix final tokenizer comparison review findings	2026-05-20 19:34:16 +00:00
tests	Build executable as tc	2026-05-21 09:18:26 +00:00
.gitignore	Add Rust token count executable	2026-05-20 21:34:51 +00:00
pyproject.toml	Harden sample index loading	2026-05-20 15:18:06 +00:00
README.md	Build executable as tc	2026-05-21 09:18:26 +00:00
THIRD_PARTY_NOTICES.md	Add embedded license notice command	2026-05-21 07:42:46 +00:00
tokenizers.broad.json	Add broad tokenizer sweep config	2026-05-20 21:27:50 +00:00
tokenizers.json	Add broad tokenizer sweep config	2026-05-20 21:27:50 +00:00

README.md

TokenCounter

Local token counting tools for estimating gptgirlfriend.online character sheet token usage.

This project exists so you can manually check character sheet text before pasting it into the site. It is not affiliated with gptgirlfriend.online; it is an unofficial estimator calibrated from sample text/count pairs collected from the site.

✨ Current Best Match

The current best matching tokenizer is:

microsoft/Phi-3-mini-4k-instruct
offset: 0

Against the current sample set, this tokenizer matches all recorded site token counts exactly:

16 / 16 exact matches
max error: 0
mean error: 0.00

If the site changes its tokenizer or counting rules, rerun the comparison tooling and update the executable/tokenizer choice.

🚀 Quick Use: Standalone Executable

Build the Rust executable:

cargo build --release --manifest-path rust/tokencount-file/Cargo.toml

Count tokens for a file:

./rust/tokencount-file/target/release/tc path/to/character-sheet.txt

Equivalent explicit command:

./rust/tokencount-file/target/release/tc count path/to/character-sheet.txt

Output is a single integer:

Example:

./rust/tokencount-file/target/release/tc samples/sample_001.txt

Suggest token-saving WordNet synonym replacements without changing the file:

./rust/tokencount-file/target/release/tc compress path/to/character-sheet.txt

Apply the best token-saving replacements and write <name>_compressed<extension>:

./rust/tokencount-file/target/release/tc yolo path/to/character-sheet.txt

Print third-party notices embedded in the standalone executable:

./rust/tokencount-file/target/release/tc licenses

compress and yolo use bundled WordNet 3.1 synsets. Lookup is lowercase, replacements preserve lowercase, Capitalized, and ALLCAPS words on a best-effort basis, and only replacements that reduce the full-text token count are suggested or applied.

🧪 Validate The Executable

Run the Rust tests:

cargo test --manifest-path rust/tokencount-file/Cargo.toml

The Rust test suite verifies that the bundled tokenizer exactly matches every sample in samples/index.jsonl.

🔬 Python Tokenizer Comparison Tooling

The Python tooling is used to compare candidate tokenizers against known site counts.

Run the main tokenizer comparison:

uv run tokencounter compare --samples samples/index.jsonl --tokenizers tokenizers.json --output results

Run the broader tokenizer sweep:

uv run tokencounter compare --samples samples/index.jsonl --tokenizers tokenizers.broad.json --output results-broad

The comparison writes:

summary.json: tokenizer ranking and aggregate error metrics.
details.csv: per-tokenizer, per-sample counts and errors.
details.jsonl: JSON Lines version of the detailed results.

Run the Python tests:

uv run pytest -v

🧠 Regenerate WordNet Synsets

The Rust executable embeds generated WordNet 3.1 synonym data at compile time. Regenerate the bundled asset with:

uv run python scripts/generate_wordnet_synsets.py

The generator downloads the Princeton WordNet 3.1 database tarball by default, keeps single lowercase alphabetic lemmas, and writes rust/tokencount-file/assets/wordnet/synsets.json.

WordNet is developed at Princeton University. The generated bundled database is distributed under the WordNet license; see THIRD_PARTY_NOTICES.md, rust/tokencount-file/assets/wordnet/LICENSE.txt, or run tc licenses for the required notice, copyright statement, and disclaimer.

📁 Sample Dataset

Sample metadata lives in samples/index.jsonl.

Each line is one JSON object:

{"id":"sample_001","file":"sample_001.txt","real_token_count":381,"source":"site","counted_at":"2026-05-20"}

Fields:

id: unique sample identifier.
file: sample text file, relative to samples/index.jsonl.
real_token_count: token count reported by the site.
source: where the count came from.
counted_at: ISO date when the count was recorded.

Add new samples when you can get reliable site counts. More samples make it easier to detect tokenizer or counting-rule changes.

🧰 Project Layout

samples/                         calibration texts and known counts
src/tokencounter/                Python tokenizer comparison tool
tokenizers.json                  main tokenizer candidates
tokenizers.broad.json            broader exploratory tokenizer candidates
rust/tokencount-file/            Rust standalone token counting executable
scripts/                         developer scripts, including WordNet generation
docs/plans/                      design and implementation notes

🏗️ Cross-Platform Builds

The Rust executable is the preferred distribution path. Rust produces native binaries more reliably than Python freezer tools and is better suited for cross-platform releases.

Local native build:

cargo build --release --manifest-path rust/tokencount-file/Cargo.toml

Future release automation can build targets such as:

x86_64-unknown-linux-gnu
aarch64-unknown-linux-gnu
x86_64-apple-darwin
aarch64-apple-darwin
x86_64-pc-windows-msvc

Cross-compilation can be set up with rustup target add ..., GitHub Actions, or cross.

📦 Forgejo Releases

Manual releases are created with the Forgejo Actions workflow named release.

Run it from Forgejo's Actions UI with workflow_dispatch. The workflow computes the next numeric release tag from existing v<number> tags, starting at v1, then creates a Forgejo release and uploads:

tokencount-file-vN-x86_64-unknown-linux-gnu.tar.gz
tokencount-file-vN-x86_64-pc-windows-gnu.zip

Each archive keeps the tokencount-file release name and contains the executable as tc or tc.exe.

The workflow uses actions/forgejo-release@v2.12.0 from the Forgejo action mirror and expects a repository secret named RELEASE_TOKEN. The token must be a Forgejo application token with write:repository permission so the workflow can create tags, create releases, and upload release assets.

⚠️ Caveats

This is an unofficial estimator.
The site may change its tokenizer or counting rules.
The Rust executable currently embeds the microsoft/Phi-3-mini-4k-instruct tokenizer.
WordNet compression is synonym-based and not context-aware; review suggested replacements before using them in final text.
Distributed copies that include the bundled WordNet-derived database should include THIRD_PARTY_NOTICES.md, an equivalent WordNet license notice, or the standalone executable's licenses command output.
The current exact match is based on the samples in this repository, not a guarantee for all possible character sheets.
Re-run the comparison tooling when adding new samples.