Docs

Two ways in: the zero-dependency Python / JS SDK for explicit logging, and the tracehouse CLI agentthat records Claude Code & Codex automatically. Sections 11–12 cover A/B experiments and the CLI.

1. Install

Requires Python 3.9+. The PyPI distribution is tracehouse-sdk; the import path stays tracehouse.

shellbash

pip install tracehouse-sdk

2. Get an API key

Keys are BetterAuth-minted and start with ba_. They work from any machine and don't expire by default. Copy once — the plaintext is never shown again.

3. Environment

.envbash

# Required:
export TRACEHOUSE_API_KEY="ba_..."

# Optional — override the default https://tracehouse.ai
export TRACEHOUSE_API_BASE="https://tracehouse.ai"

4. Chat traces — Run / cm.init / cm.log_*

Module-level wandb-style surface for Claude Code style sessions. Each session_id is resumable: posting again with the same id from the same machine reuses the trace row.

trace.pypython

import tracehouse as cm

cm.init(
    project="my-bot",
    session_id="run-001",          # resumable: same id reuses the trace
    api_key=...,                   # or set TRACEHOUSE_API_KEY env
)
cm.log_user("hello")
cm.log_assistant("hi back")
cm.log_tool_use("Read", {"file_path": "x.py"})
cm.log_tool_result("contents of x.py …")
cm.finish(outcome="good", metadata={"model": "claude-sonnet-4-6"})

5. Training runs — cm.init_run

Parallel entity to traces. Logs scalar metrics into a fast float-typed column; lists / dicts (gradient norms, lr schedules, histograms) go to JSONB and render as bar charts. Metrics are idempotent on (run_id, key, step) — safe to retry on flaky networks.

train.pypython

import tracehouse as cm

run = cm.init_run(
    project="demo",
    name="qwen-sft-v1",
    config={
        "lr": 1e-4,
        "batch": 32,
        "base_model": "Qwen/Qwen2.5-0.5B",
    },
)

for step in range(1000):
    run.log({"train/loss": loss, "eval/acc": acc}, step=step)

    # Lists / dicts go into a JSON column → rendered as a bar chart.
    run.log({"grad/norm_hist": [0.1, 0.2, 0.4, 0.3]}, step=step)

run.link_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
run.link_model("your-handle/my-finetune")
run.add_artifact("hparams", data={"warmup_ratio": 0.03, "weight_decay": 0.0})
run.finish(status="finished")

Context manager auto-finishes (and marks crashed on exceptions):

ctx.pypython

# Auto-finish on exit. Exceptions → status="crashed".
with cm.TrainingRun(project="demo", name="exp-42") as run:
    for step, batch in enumerate(loader):
        loss = train_step(batch)
        run.log({"train/loss": loss}, step=step)

6. Media — images & videos

Log images and videos to a run with cm.Image / cm.Video — either inside run.log({…}) next to metrics, or via run.log_image / run.log_video. They appear under the run's Media tab, grouped by key. Bytes are sent raw (no base64), capped at 25 MB per item.

media.pypython

import tracehouse as cm

run = cm.init_run(project="demo", name="qwen-sft-v1")

# Log images / videos right next to metrics. cm.Image accepts a file path,
# raw bytes, a PIL image, or a numpy array (Pillow needed only for arrays).
run.log(
    {"loss": loss, "samples": cm.Image("out/epoch3.png", caption="epoch 3")},
    step=3,
)

# Or log media explicitly:
run.log_image("val/grid", "preview.png", caption="val grid", step=10)
run.log_video("rollout", "clip.mp4", step=10)

# cm.Video takes a file path or raw bytes (mp4 / webm / mov).
# Media shows up under the run's Media tab. Limit: 25 MB per item.

7. Push model card to HuggingFace

The server renders a model card from the run's name, config, final metrics, and linked refs — then pushes it as README.md via the HF Hub commit RPC. Needs a write-scope HF token saved on your profile.

push.pypython

# Requires:
#   1. run.link_model(...) earlier so we know the repo
#   2. A write-scope HF token saved in /profile
resp = run.push_model_card(commit_message="Initial card from tracehouse")
print(resp["commit_url"])  # https://huggingface.co/.../commit/<sha>

8. Logging

The SDK uses stdlib logging under tracehouse. INFO covers lifecycle events; DEBUG adds every HTTP request and response with byte counts; WARNING fires on API errors and dropped (NaN/Inf) metric points.

setup.pypython

import logging

# Library doesn't call basicConfig — set it up once in your application.
logging.basicConfig(level=logging.INFO, format="%(name)s %(message)s")

# Verbose HTTP: every → / ← request, byte counts, dropped points.
logging.getLogger("tracehouse").setLevel(logging.DEBUG)

9. Reinforcement learning — runs + rollouts

Log a run's metrics and its per-step rollout conversations together. run.rollout(step=…) opens a chat trace already linked to the run, so each rollout shows up under the run's Rollouts tab (step → trace). It returns a normal Runand inherits the run's auth — an anonymous run produces anonymous rollouts under the same identity, and one claim link covers both.

rl.pypython

import tracehouse as cm

run = cm.init_run(project="rl", name="ppo-v1", config={"lr": 1e-5})

for step in range(1000):
    # One chat trace per rollout, tied to this run + step.
    with run.rollout(step=step) as t:
        t.log_user(state)
        t.log_assistant(action)
        t.log_tool_result(f"reward={reward}")
    run.log({"reward": reward, "kl": kl}, step=step)   # metrics on the run

run.finish()
# The run page gets a "Rollouts" tab: step 0 → trace, step 1 → trace, …

10. Drop-in for wandb

tracehouse ships a wandb-compatible surface under tracehouse.wandb. Swap the import, or override sys.modules["wandb"] to redirect existing import wandb code with no edits. Parity covers init / log / config / summary / finish / Image / Video / Histogram / define_metric.

wandb_override.pypython

# Option A — swap the import (new code):
from tracehouse import wandb          # or: import tracehouse.wandb as wandb

# Option B — override existing `import wandb` everywhere, zero edits.
# Put this before the first `import wandb` runs:
import sys, tracehouse.wandb
sys.modules["wandb"] = tracehouse.wandb

# Either way, the usual wandb call sites just work:
run = wandb.init(project="demo", name="qwen-sft", config={"lr": 1e-4})
for step in range(1000):
    wandb.log({"train/loss": loss}, step=step)
    wandb.log({"samples": wandb.Image("out.png")}, step=step)
wandb.config.update({"warmup_ratio": 0.03})
wandb.summary["best_loss"] = best_loss
wandb.finish()

11. Experiments & subjects (A/B cohorts)

Tag any run or trace with three cohort dimensions: subject_id (the end user the agent acted for), experiment (an A/B test name) and variant (the arm, e.g. control vs treatment). There's nothing to set up first — the experiment row and its variants materialize automatically on the first tagged run.

Python

cohorts.pypython

import tracehouse as cm

# Tag a run with three cohort dimensions. Nothing to pre-create:
# the experiment + variant materialize on the first tagged run.
run = cm.init_run(
    project="agent-evals",
    name="run-001",
    subject_id="customer-42",      # the end user this run acted for
    experiment="prompt-rewrite",   # the A/B test name
    variant="treatment",           # this run's arm (vs "control")
    config={"model": "claude-sonnet-4-6"},
)
run.log({"reward": reward}, step=0)
run.finish(status="finished")

# Chat traces take the same dimensions, at init or at finish:
cm.init(project="agent-evals", session_id="s-1",
        subject_id="customer-42", experiment="prompt-rewrite", variant="control")
cm.finish(outcome="good")

JavaScript / TypeScript

cohorts.tsjavascript

import { initRun, init, finish } from "@tracehouse/sdk";

const run = await initRun({
  project: "agent-evals",
  name: "run-001",
  subjectId: "customer-42",        // end user this run acted for
  experiment: "prompt-rewrite",    // A/B test name
  variant: "treatment",            // this run's arm (vs "control")
  config: { model: "claude-sonnet-4-6" },
});
await run.log({ reward }, 0);
await run.finish({ status: "finished" });

// Chat traces accept the same dimensions (camelCase) at init or finish:
await init({ project: "agent-evals", sessionId: "s-1",
  subjectId: "customer-42", experiment: "prompt-rewrite", variant: "control" });
await finish({ outcome: "good" });

The backend reduces each variant to a cohort aggregate (flag / error / loop rates, cost, latency) and reports control-relative deltas with deterministic significance tests(Wilson intervals, z-test for proportions, Welch's t for means) — no LLM in the loop. The web UI renders this at /experiments; per-user rollups live at /subjects.

compare.shbash

# Stats are computed server-side — no LLM, fully deterministic.
# GET the comparison for an experiment by name:
curl -H "Authorization: Bearer $TRACEHOUSE_API_KEY" \
  https://tracehouse.ai/v1/experiments/prompt-rewrite/compare

# -> per-variant cohort aggregates (flag/error/loop rates, cost, latency)
#    + control-relative deltas with Wilson / z / Welch significance.
# The UI renders this at /experiments; per-user rollups live at /subjects.

12. CLI agent — record Claude Code & Codex automatically

The tracehouse CLI is a local, zero-config recorder. It installs a PreToolUse hook into ~/.claude/settings.json and runs a small background daemon that tails the JSONL transcripts Claude Code & Codex already write — parsing each prompt, thought, tool call and result into spans and shipping them to the backend as live traces. PII is redacted locally before anything leaves the machine. No SDK calls, no code changes to your agents.

Install

install.shbash

# One line — detects Claude Code & Codex, installs the binary,
# patches ~/.claude/settings.json, and starts the background daemon.
curl -fsSL https://tracehouse.ai/install.sh | bash

# Equivalent manual path:
tracehouse login --api-key ba_...   # save key + backend URL (once per machine)
tracehouse install                  # add the PreToolUse hook + background service
tracehouse agent                    # start tailing transcripts (`--once` for one pass)

Commands

tracehouse --helpbash

tracehouse              # interactive setup / status TUI (no subcommand)
tracehouse login        # save API key + backend URL locally (wandb-style)
tracehouse install      # patch settings.json hook + install the daemon
tracehouse uninstall    # remove the hook + stop/remove the service
tracehouse agent        # long-running daemon: watch JSONL roots, ship spans
tracehouse agent --once # single sync pass, then exit (good for cron / CI)
tracehouse doctor       # diagnostics: key valid? machine_id? last sync?
tracehouse hook         # internal: PreToolUse handler invoked by Claude Code

Run tracehouse with no subcommand for an interactive setup / status TUI, or tracehouse doctor to confirm the key is valid and spans are syncing. Connected machines and their last-sync status show up at /devices.