Teich

Agent data infrastructure for generation, normalization, formatting, response masking, and training audits.

Teich turns raw agent sessions, chat datasets, local JSONL, Hugging Face datasets, and in-memory datasets.Dataset objects into auditable SFT data.

It handles the parts that usually break training runs:

normalizing traces into OpenAI-style messages and tools
preserving tool schemas, reasoning, metadata, and provenance
rendering through your target tokenizer's chat template
recording typed supervision spans before tokenization
applying response-only labels after TRL / Unsloth trainer tokenization
reporting dropped, oversized, trimmed, malformed, and fully masked rows

Use it as a trace generator, a dataset loader, a chat-template renderer, a masking layer, or the whole pipeline.

Install

pip install teich

Or run it without installing:

uvx teich --help

Agent trace generation needs Docker and an API key for the configured provider. Preparing an existing local or Hugging Face dataset does not need Docker.

Prefer a browser workflow?

teich studio

See Teich Studio.

Quickstart: Prepare Existing Data

If your dataset already has messages, Teich can usually prepare it directly.

from teich import prepare_data

train_dataset = prepare_data(
    "TeichAI/Claude-Opus-4.6-Reasoning-887x",
    tokenizer,
    max_length=32768,
    oversized_policy="trim_followups",
    tokenize=True,
    chat_template_kwargs={"enable_thinking": True, "preserve_thinking": True},
)

Then create your trainer and call mask_data():

from teich import mask_data

trainer = mask_data(
    trainer,
    tokenizer=tokenizer,
    train_on_reasoning=True,
    train_on_final_answers=True,
    train_on_tools=True,
)

More detail: Preparing Data and Training.

Quickstart: Generate New Traces

teich init my-project
cd my-project

Add prompts to prompts.jsonl:

{"prompt":"Build a simple todo list app in React"}
{"github_repo":"armand0e/perplexica-mcp","prompt":"Add a small usability improvement and update the tests"}
{"prompt":"Draft a compact project plan","follow_up_prompts":["Revise it for a solo developer","Add a risk checklist"]}

Set your provider key and run:

export OPENAI_API_KEY=sk-...
teich generate -c config.yaml

Teich writes raw traces, converted training rows, sandbox snapshots, a compact dataset card, and sometimes tools.json under output/. Use --resume to skip prompts that already completed.

More detail: Generation.

Quickstart: Extract Local Sessions

If you already have local agent sessions, Teich can stage them as an anonymized dataset in one command:

teich extract claude --model fable-5

extract supports claude, codex, cursor, pi, and hermes. It writes anonymized traces to data/ by default using provider-native or recovered session JSONL files. The generated Hugging Face dataset metadata matches **/*.jsonl, so providers such as Cursor can preserve nested project transcript paths. It generates a dataset README.md, and then asks whether to upload the folder to Hugging Face. Use --out / --output to choose another folder.

If the agent store is somewhere other than the default home-directory location, pass it explicitly. --sessions-dir accepts either the agent root, such as .claude, .codex, .pi, or .hermes, or the native store under it, such as .claude/projects, .codex/sessions, .hermes/state.db, or Cursor's workspaceStorage / globalStorage/state.vscdb:

teich extract claude --sessions-dir /path/to/.claude --out data
teich extract claude --sessions-dir /path/to/.claude/projects --out data
teich extract codex --sessions-dir /path/to/.codex --out data
teich extract codex --sessions-dir /path/to/.codex/sessions --out data
teich extract pi --sessions-dir /path/to/.pi --out data
teich extract pi --sessions-dir /path/to/.pi/agent/sessions --out data
teich extract pi --sessions-dir /path/to/.pi/sessions --out data
teich extract hermes --sessions-dir /path/to/.hermes --out data
teich extract hermes --sessions-dir /path/to/.hermes/state.db --out data
teich extract cursor --sessions-dir /path/to/Cursor/User/workspaceStorage --out data
teich extract cursor --sessions-dir /path/to/Cursor/User/globalStorage/state.vscdb --out data

Extraction anonymizes staged traces by default. To keep the raw extracted data unchanged, pass --no-anon or --no-anonymize and review the output carefully before sharing or uploading it.

To convert raw or extracted traces into standalone OpenAI-style JSONL rows that can be consumed without Teich at training time:

teich convert data --out teich-training.jsonl

This writes standalone OpenAI-style rows with prompt, messages, tools, and metadata. Use prepare_data() and mask_data() when you want Teich to handle tokenizer-specific formatting and response-only labels.

What Teich Supports

Use case	Start here
Find command examples and options	CLI Reference
Configure and steer runs in a browser	Teich Studio
Generate Codex, Pi, Claude Code, Hermes, or chat data	Generation
Load local files, folders, Hugging Face datasets, or `datasets.Dataset` objects	Preparing Data
Train with TRL / Unsloth while keeping response-only labels correct	Training
Understand `messages`, `tools`, metadata, and native trace behavior	Data Format
Use `prepare_data`, `mask_data`, `load_traces`, and validation helpers	Python API
See the full generation, preparation, and masking pipeline	Pipeline Flow

Why Teich

Most SFT pipelines flatten agent data too early. That loses tool schemas, tool results, reasoning boundaries, provenance, and the exact assistant spans you meant to train on.

Teich keeps the data structured until the last practical moment:

prompts / traces / JSONL / HF datasets / Dataset objects
        -> load_traces() or prepare_data()
        -> normalized messages + tools
        -> tokenizer chat template rendering
        -> trainer-friendly text + Teich supervision spans
        -> SFTTrainer tokenization
        -> mask_data()
        -> audited input_ids + labels

This makes multi-turn, tool-call, reasoning, and mixed-source datasets trainable without relying on brittle single-span masking.

Common Commands

# Create a generation project
teich init my-project

# Generate data from config.yaml
teich generate -c config.yaml

# Resume an interrupted batch
teich generate -c config.yaml --resume

# Extract, anonymize, and stage local Claude Code traces
teich extract claude --model fable-5 --out data

# Convert staged raw traces to standalone OpenAI-style training JSONL
teich convert data --out teich-training.jsonl

# Launch the local browser UI
teich studio

# Use a local OpenAI-compatible endpoint
TEICH_PROVIDER=LMstudio \
TEICH_MODEL=gemma-4 \
TEICH_BASE_URL=http://localhost:1234/v1 \
TEICH_API_KEY=llm \
teich generate -c config.yaml

Minimal Config

agent:
  provider: codex  # codex, pi, claude-code, hermes, or chat

model:
  model: codex-mini-latest
  approval_policy: never
  sandbox: danger-full-access

prompts_file: prompts.jsonl

output:
  traces_dir: ./output
  sandbox_dir: ./sandbox
  failures_dir: ./failures

publish:
  repo_id: username/my-dataset
  private: false

agent.provider: chat writes structured chat rows directly and does not require Docker. Agent providers preserve raw or native traces as source-of-truth artifacts.

To run Codex on your ChatGPT subscription instead of an API key, set agent.codex.use_host_auth: true (Teich shares your host codex login across containers), and enable Codex fast mode with model.service_tier: fast. See Generation.

Python Entry Points

from teich import (
    prepare_data,
    mask_data,
    load_traces,
    detect_trace_type,
    validate_tool_calls,
    row_fits_context,
    trace_is_complete,
    preview_sft_example,
)

See Python API for the full public surface.

Status

Teich is alpha. The core trace, preparation, masking, and audit workflow is usable, but APIs may evolve as more agent formats and training flows are added.

Development

uv pip install -e ".[dev]"
uv run pytest --ignore=tests/test_integration.py -q

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 189 Commits
.github/workflows		.github/workflows
assets		assets
docker		docker
docs		docs
examples		examples
src/teich		src/teich
tests		tests
.gitignore		.gitignore
DOCS.md		DOCS.md
LICENSE		LICENSE
README.md		README.md
agent_prompts.jsonl		agent_prompts.jsonl
chat_prompts.jsonl		chat_prompts.jsonl
config.example.yaml		config.example.yaml
config.yaml		config.yaml
gemma-template.jinja		gemma-template.jinja
gemma4_example.py		gemma4_example.py
prompts.jsonl		prompts.jsonl
pyproject.toml		pyproject.toml
teich_example.py		teich_example.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Teich

Install

Quickstart: Prepare Existing Data

Quickstart: Generate New Traces

Quickstart: Extract Local Sessions

What Teich Supports

Why Teich

Common Commands

Minimal Config

Python Entry Points

Status

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Teich

Install

Quickstart: Prepare Existing Data

Quickstart: Generate New Traces

Quickstart: Extract Local Sessions

What Teich Supports

Why Teich

Common Commands

Minimal Config

Python Entry Points

Status

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages