KARPATHY AND THE NEW EMPIRICIST
Autoresearch and the shifting boundary between problem formulation and execution
Francis Bacon argued in the
Novum Organum that knowledge advances not through deduction from first principles but through direct, systematic engagement with nature. The experimenter occupies a privileged position in this account. To observe, to measure, to revise one's hypothesis in light of what the data reveals constitutes a mode of inquiry that much of technical work has organized itself around for four centuries.
Andrej Karpathy's
autoresearch suggests that this premise warrants revision.
Concept Space, Not Parameter Space
The instinct is to file autoresearch alongside Bayesian optimization and neural architecture search, tools that automate hyperparameter sweeps. Karpathy himself pushed back on this reading. Neural architecture search, he wrote, is "such a weak version of this that it's in its own category of totally useless by comparison."
AutoML methods traverse a predefined parameter grid. The agent in autoresearch reads code, forms a hypothesis about why loss is high, modifies the implementation, observes the result, and uses that observation to decide what to try next. The search is not over numerical values but over conceptual possibilities, ideas about what might be wrong and how to fix it.
The five-minute wall-clock budget makes this tractable. It keeps the feedback loop tight enough for the agent to iterate at the rate of hypothesis generation rather than waiting for full training convergence. The agent gets a fast, noisy signal, decides whether to keep or discard the change, and moves on. Over two days, this rhythm produced 700 experiments. Not 700 random samples from a grid, but 700 sequential acts of interpretation and response.
The Experiment-Running Machine
The gap between what the agent found and what Karpathy contributed is instructive. Karpathy chose validation bits per byte as the metric and set the five-minute budget. To give the agent a tractable space to explore, he designed nanochat around a single complexity dial, the depth of the transformer, from which all other hyperparameters derive. He constrained the workflow to a branch-based structure where failures were reversible. Each of these decisions required domain understanding, taste, and strategic judgment. That is, Karpathy designed the experiment-running machine and the agent ran the experiments.
This decomposition of intellectual work is both specific and replicable. On one side sits everything that can be structured as a tight feedback loop with measurable outcomes. On the other sits the work of deciding what to optimize, how to measure it, what the search space should look like, and what constraints to impose.
Five Minutes and a Score
The pattern generalizes wherever someone can build the equivalent of a five-minute training run with an unambiguous evaluation signal.
A/B testing for marketing copy already has this structure. When a visitor lands on a page, the outcome is immediate and binary, as they either convert or don't. The metric is straightforward, and while the space of possible copy variations is vast, it remains searchable.
Drug candidate screening fits the same template. Instead of waiting for clinical results, one can run simulations or binding assays that return results rather quickly. The metric becomes predicted affinity rather than conversion rate, and the search space shifts from headline variations to molecular modifications.
In both cases, the underlying architecture is identical, with rapid iteration, clear scoring, and systematic exploration. The human sets the objective, and the agent explores the space.
Bacon, Inverted
For much of the history of technical work, the experimenter's advantage was access to the trial itself. The person who could run the experiment, observe the result, and revise the hypothesis occupied a position that reasoning alone could not substitute. What autoresearch demonstrates is that the trial-running is becoming automatable, and with it the cognitive style that empiricism has long elevated. The new empiricist does not run the experiment. The new empiricist decides what question the experiment should answer, and builds the machine that answers it.