I’ve spent most of my career deciding what to build. This was an experiment in getting back to how — sitting down with a real dataset, a vague hypothesis, and a coding agent, and seeing how far I could push a prediction model in a weekend rather than a quarter.

The short version: the modelling was never the bottleneck. Framing the question, cleaning the inputs, and knowing which result to trust — that’s where the time went, and that’s exactly the part that doesn’t automate away.

The setup

I started with a messy, real-world dataset and a single question I actually cared about. Rather than reaching for a notebook and stalling on boilerplate, I described the shape of the problem in plain language and let the agent scaffold the pipeline — ingestion, features, a baseline, an evaluation harness — then argued with it until the baseline was honest.

The model was the easy 20%. The other 80% was still judgement.

What actually moved the needle

Progress came from the boring, human moves: defining the target properly, catching leakage before it flattered the numbers, and choosing a metric that matched the decision the model was meant to serve. The agent made all of that faster to try, which meant I could run ten framings in the time one used to take.

The takeaway I keep coming back to: the ceiling on this kind of work isn’t typing speed or library knowledge anymore. It’s taste — knowing which question is worth asking, and which answer is worth believing.