Here’s a question nobody is answering well: should we deploy code that no human has fully read?
Not code that was reviewed by a human glancing at a diff, checking that it looks roughly right, approving it because the tests pass. I mean code where no one on your team traced the logic end to end, verified the architectural decisions, or confirmed that the new service follows the same conventions as the eleven other services that came before it. Code that was generated, tested green, and shipped.
This is already happening. Not at experimental startups performing AI stunts for social media engagement, but at “normal” companies, with “normal” engineers, who use Claude Code, Codex, Cursor, or Copilot to generate a resource, watch the tests pass, and merge. The generation problem is effectively solved. An LLM can write a CRUD service, a migration, a set of hooks, and an auth flow. It can do it in seconds. The output is usually syntactically correct, often functionally correct, and occasionally architecturally correct.
That last part is where everything collapses. Because “architecturally correct” means a code that doesn’t just work but works the way everything else in the project works. A code that uses the same patterns, the same file structure, the same hooks execution order, the same naming convention, and handles errors the same way. A human senior engineer enforces this through taste and context. An LLM enforces it through pattern-matching on whatever it can see. And what it can see is usually not enough.
The result is code that passes every test and is still wrong. Not wrong enough to crash. Wrong enough to create inconsistency, technical debt, and subtle bugs that surface weeks later when someone, human or agent, tries to build on top of it. This is the trust problem, and it is now the bottleneck in agent-driven development. Not generation quality. Not model capability. Trust.
Predictability is a design choice
The first instinct most people have when their agent-generated code is unreliable is to want a better model. A smarter agent would indeed produce better code. But you cannot control the model; you can only control the environment in which the model operates.
I learned this the hard way. I spent several weeks building Feathers BaaS, an open-source CLI that scaffolds production-ready Node.js backends. My original goal was speed. I wanted to give a solo developer, or a small team, a working backend in less than five minutes. But the deeper I got, the more I realized the real value wasn’t speed. It was constraint.
A few things I discovered that I didn’t expect:
Rigid directory conventions beat flexible architecture: Every service Feathers BaaS generates follows the same four-file pattern:
schema,class,hooks,service. No exceptions, no variations, no optional extras. An agent looking atposts/knows exactly what it will find incomments/andlikes/, same files, same naming, same structure. It never has to infer where the hooks go or guess how the schema relates to the class. The pattern is the documentation. I initially considered making the structure configurable. I dropped that immediately. Every optional layout is a fork where the agent might go the wrong way. The rigidity isn’t a limitation. It’s the feature.Convention density matters more than documentation: Humans like flexibility. Agents need rigidity. The more opinionated the scaffold, the fewer decisions the agent has to infer, the fewer mistakes it makes. Every optional pattern is a fork where the agent might go the wrong way. I started treating optionality as a liability rather than a feature. This runs directly against how most frameworks are designed, and it works.
The project has to describe itself: I added a
describecommand that outputs the entire project structure as JSON: every service, method, hook, and migration in a machine-readable format. Beforedescribe, the agent guessed at the project’s shape by grepping. Afterdescribe, it knew. This is the equivalent of an API having good docs, except it’s for the codebase itself.AGENTS.md is the new README. I wrote a conventions file specifically for LLMs, not documentation for humans, but instructions for agents. Naming patterns, where generated code goes, which hooks to attach by default, what not to do. It felt strange to write a document whose audience wasn’t a person.
Every one of these discoveries points to the same conclusion: you don’t make agent output trustworthy by improving the model. You make it trustworthy by making the environment so constrained that there are fewer ways to be wrong. Predictability is not a model capability. It’s a design choice you make in the codebase, the scaffold, the conventions, the tooling. The agent is only as reliable as the patterns it has to follow.
Eval is the missing layer
But even in a maximally constrained environment, the agent will sometimes produce subtly wrong code. And this is where the entire toolchain falls apart, because we have no good way to evaluate agent-generated code at the level that actually matters.
We have unit tests. Unit tests verify behavior: does the endpoint return a 200 status code? Does the query filter correctly? They do not verify architecture: is this service structured the same way as every other service? Are the hooks applied in the right order? Does the migration match the schema? A service can pass every unit test and still be architecturally inconsistent with the rest of the project.
We also have code review. Code review catches architectural inconsistency, but it requires a human who understands the conventions, and it doesn’t scale. If the whole point of agent-driven development is that an LLM writes code faster than a human, gating every output on a human reviewer defeats the purpose. You’ve just replaced “human writes code slowly” with “machine writes code fast, human reviews code slowly.” The bottleneck moved; it didn’t disappear.
What’s missing is something I’d call structural evaluation: automated verification that generated code conforms to the project’s architectural patterns, not just its behavioral requirements. Does the new service follow the same file structure? Are the hooks consistent? Does the error handling pattern match? Is the naming convention respected? Does the migration use the same column types and constraints as equivalent fields in other services?
None of this is conceptually hard. It’s pattern-matching against the project’s own conventions. But almost nobody is building tooling for it, because the entire industry’s attention is on making generation better with faster models, bigger context windows, and better prompts. The eval side is orphaned. The result is that we’re generating more code than ever and trusting it less than ever, and the gap between those two lines is widening every week.
The uncomfortable conclusion
Here’s what I now believe that I didn’t believe six months ago: the future of agent-driven development is not better models writing more creative code. It is more boring codebases that constrain the model into producing predictable output, combined with structural eval tooling that verifies the output matches the project’s conventions without requiring a human in the loop.
This is an uncomfortable conclusion if you’re a developer who takes pride in elegant abstractions, creative architecture, and expressive code. The codebase that is most legible to an agent is also the most repetitive, most convention-heavy, and least interesting to write by hand. Boring code is trustworthy code. That’s the tradeoff, and I don’t think the industry has reckoned with it yet.
I built Feathers BaaS as my answer to the predictability side of this problem. A scaffold rigid enough that an agent’s output is consistent by construction, not by luck. But the eval side is still wide open. Whoever builds the structural evaluation layer for agent-generated code, the tool that tells you this service is architecturally consistent with the rest of your project before it reaches a human reviewer, is building something the entire industry needs and nobody has shipped yet.
The generation problem is solved. The trust problem is the real work now.

