My Story
The first time I saw software fail, it was obvious. A crash. A frozen screen. An error message nobody could understand.
That's how systems used to break. Loudly.
I spent over a decade at Microsoft learning to find failures before users did. Exploratory edge-case analysis on Movie Maker—frame drops, encoding errors, graphics card incompatibilities. Security work on Office—thousands of fuzzing tests, hunting buffer overflows in code nobody had looked at in years. Infrastructure on Xbox—making sure backward-compatible games made it through the pipeline without breaking.
The work was unglamorous. Repetitive. Often invisible.
But it mattered. Every bug found was a user who wouldn't lose their work. A system that wouldn't crash at the wrong moment. Trust, quietly earned.
The question was always the same: How could this fail?
Ask it enough times, and you start to see the world differently. You notice the seams. The assumptions. The places where confidence outpaces evidence.
Then AI arrived.
I started using ChatGPT and did what I've always done—tried to break it.
But nothing broke. No crashes. No error messages. Just fluent, confident answers. Some right. Some wrong. Some fabrications. Some reasoning errors. Some failures so subtle they slipped past entirely. All delivered with the same certainty.
I remember one moment clearly. I asked a model to summarize a technical concept I knew well. The response was articulate, structured, confident. It was also subtly wrong—not in a way that screamed "error," but in a way that would have quietly misled anyone who didn't already know the answer. That's when I realized: the danger isn't the obvious failures. It's the ones that sound right.
That's when something shifted.
Traditional systems fail loudly. They tell you when something goes wrong. You can catch the failure, trace it, fix it.
AI fails quietly.
It doesn't know it's wrong. It can't tell you. The response looks the same whether the answer is accurate or invented. The failure isn't in the system. It's in us, humans—in the moment we trust without verifying.
This is a different kind of problem. And it requires a different kind of craft.
This is the work I do today.
The old tools don't quite fit.
You can't write a test that expects the same output twice—the system is non-deterministic. You can't always define "correct" because for many tasks, there is no ground truth. Benchmarks that look impressive often fail to predict real-world behavior. Metrics capture what's easy to measure, not what matters.
But that doesn't mean evaluation is impossible. It means we evaluate differently. We can't expect the same string twice, but we can evaluate intent. Safety boundaries. Behavioral consistency. Whether the system stays within the guardrails we've defined—even when its words vary.
And we have to understand the failure modes.
Sycophancy—telling users what they want to hear. Instruction drift—losing track of what was asked. Context confusion—mixing details across turns. Confident fabrication—inventing facts that don't exist.
These aren't bugs you can trace to a line of code. They're behaviors that emerge from systems we don't fully understand.
But the instincts still apply.
Think adversarially. Ask how things could go wrong. Look where others don't. Document what you find. Stay humble—the system will always surprise you.
Evaluation isn't about catching AI being wrong. It's about understanding how it behaves. Building frameworks that reveal patterns. Earning trust through rigor, not assumption.
More recently, my work has focused on mapping governance requirements to testable criteria—turning policy into practice. Understanding when LLMs can evaluate other LLMs, and when they can't. Building systematic approaches for systems that refuse to behave the same way twice.
The craft of quality engineering doesn't disappear in the age of AI. It becomes more important. More nuanced. More human.
The field is new. The "best practices" are being written in real-time.
I'm not interested in claiming expertise. I'm interested in building the rigor that creates it.
I've spent a long time learning how to sit with a problem, staring at the screen until something useful emerges. I used to say to my co-workers, "I need to stare at this longer." Sometimes hours. Sometimes days. But funny thing—at the end, something useful always emerges. Always.
That's the work now. Pause. Stare at the answers. Document the failures. Look for patterns. Build the evaluation before you build the system.
Understand how AI fails—not to fear it, but to build systems worthy of trust.
The question hasn't changed: How could this fail?
The difference is, the system won't tell you when it does.
That's why I'm still here. Staring. Asking. Waiting for something useful to emerge.