auto lab


Every existing AI training system is built for humans to operate. A human writes the config, watches the dashboard, decides when to stop, launches the next experiment. AI has automated everything inside the training step — forward pass, backward pass, data pipeline. The research decisions between steps are still manual. Not because they can't be automated. Because the interfaces were never designed for anything else.

A coding agent can already SSH in, tail a log, edit a YAML, and relaunch a run. It just can't do it fast enough, reliably enough, or with persistent state. Three gaps compound:

Latency. A coding agent driving training over SSH spends 30-90 seconds per action — parsing dashboards, writing bash, watching execution. The control plane collapses that overhead to a tool call returning structured data in milliseconds.

Reliability. Forking a run is five fragile bash commands in sequence — any one of which can fail halfway and leave the system inconsistent. The control plane makes fork atomic: it either returns a new run ID or it fails cleanly.

State. Training campaigns span days. Agent sessions don't. The control plane externalizes the experiment tree, decision log, and per-run metadata into a store the next agent session can query from scratch.

You can deploy a web app with SSH and shell scripts. Or you can use Kubernetes. Both get the app running. One of them is infrastructure.

The capability this unlocks is not faster AutoML. AutoML searches a fixed grid. An agent operating the control plane writes research procedures — kill a run on a trend, fork from a specific checkpoint, modify a schedule in flight, decide which branch to continue based on live telemetry. The action space isn't larger — it's compositional. AutoML searches a space; an agent operating the control plane writes the search.

Auto Lab is one such primitive. It is a control plane that lets an AI agent operate RL training runs through a structured tool API instead of shell commands. It exposes atomic operations the agent can call, a telemetry stream whose cadence the agent controls, and an experiment store that survives across agent sessions. In a documented autonomous run, the agent waited through warmup before intervening, forked aggressively when trends supported it, diagnosed an eval truncation bug by cross-referencing training samples against eval output, and reported the resulting limitation honestly when the budget expired. The substrate works. The next slices — multi-trainer concurrency, richer fork semantics, agent operators that span weeks — are the obvious extensions.

Before a system can improve its own training, it has to operate its own training. The gap is not intelligence — it's the interface. Redesign the interface for an AI operator and you unlock a qualitatively different capability. This is the infrastructure primitive that self-accelerating AI R&D needs.


https://github.com/arighosh05/auto-lab