AI for SMBs #23 evaluating ai models like a pro

← Back to all posts

In the past few weeks, Anthropic quietly released Claude 4.1 and OpenAI grabbed headlines around the world with their long-awaited release of ChatGPT 5. Based on early reactions, the vibes with both new models feels off.

But "vibes" is not the only way to evaluate a new AI model.

The pace of releases isn't slowing down. If anything, it's accelerating. In the first half of the year, we’ve seen ~three new frontier models released each mont from the leading US-based labs. You need a systematic approach to evaluate whether each new release deserves your attention, or if you should stick with your current choice.

Most businesses make one of two mistakes: either they ignore new models entirely and miss genuine improvements, or they chase every shiny release and waste time on incremental updates. Both approaches cost you money and momentum.

Here's how to evaluate new AI models efficiently and make smart decisions about when to switch.

Step 1:
The Screener Test

Every time a new frontier model launches from OpenAI, Anthropic, or Google, spend 30 minutes with it. Not reading about it, not watching demo videos of other people using it, but actually using it yourself for tasks you'd normally handle with AI.

Use it for 2-3 routine tasks: drafting an email, analyzing some data, or creating content. This shows you how it handles your everyday work.

Once you’re gotten a feel for this new model, it’s time for you to ask what I think of as your “personal evaluation” question. I encourage everyone to develop your own signature question that you ask every new model. This gives you a consistent baseline for comparison in your area of deepest expertise.

I’m a bit of a frameworks dork, so my signature question combines three complex methodologies and asks the model to synthesize them into a consulting framework. Specifically, I ask every new model to, “Combine the the methodologies of Playing to Win as taught by Roger Martin, design thinking as practiced by IDEO, and cybernetics as expressed by Stafford Beer in The Brain of the Firm. Tell me how this combined methodology could be applied in a generative AI era.”

Before MQC, I was a partner at IDEO. I’ve been formally trained in the PTW strategy methodology, and I’d welcome you to join me in going down the rabbit hole of Stafford Beer, a pre-Internet management theorist whose ideas hit particularly hard these days (the excellent Santiago Boys podcast is how I first learned about him; highly recommended).

I’ve discussed these methodologies and their intersection points with countless humans over beers, read the books, and have asked this question so many times of generative AI that I can quickly tell when it’s re-treading a well-explored area or truly breaking new ground.

Your signature question should test something you know exceptionally well. Maybe it's:

A technical problem in your industry that requires nuanced understanding
A complex customer scenario with multiple variables
A strategic framework that combines several business concepts
A creative challenge that requires both analysis and innovation

Ethan Mollick asks models to show him an otter on a plane using wifi. Simon Willison asks every model to make him an SVG of a pelican on a bicycle - he recently delivered this great talk that highlights just how rapidly these models are advancing over the first 6 months of 2025.

The question should be difficult enough that you can immediately spot shallow or incorrect answers, but not so specialized that you're testing obscure knowledge rather than reasoning capability. (Yes, Stafford Beer is obscure but not too obscure; I’ve found nearly every model to be able to reasonably explain his theory of cybernetics.)

This combination of routine work and expert evaluation gives you a complete picture. The everyday tasks show you how the model handles typical business needs. Your signature question reveals whether it truly understands complex concepts in your domain of expertise. Together, these two approaches tell you whether the model deserves deeper evaluation in just a few hours - and, along the way, you got some work done and hopefully had a super interesting chat about a topic you love.

It's hard to predict what you'll find. Most of the time, you'll discover the new model is a lateral move or only a slight improvement. But sometimes, maybe once every few months, you get a response so good that you need to get up and walk away from your computer for a minute.

These are the moments you're looking for. When a whole new pathway of capabilities opens up. When you realize the model understood something about your question that previous models missed entirely. When it connects dots you didn't even know existed.

These breakthrough moments are why the screening process matters. You're not just comparing features, you're hunting for step-function improvements that could transform how you work.

Step 2: The Interview Process

If a model passes your screener, it's time for a more substantial evaluation. This is where you test whether the model can handle real work that matters to your business.

Choose a multi-step project that typically takes you 2-3 hours with your current AI setup. Something complex enough to reveal the model's capabilities but contained enough that you're not risking mission-critical work.

Use your existing jig instructions as a prompt library. Don't rebuild your jigs in the new model yet, just copy your custom instructions and knowledge base content as context for individual conversations.

For example, if you've built a Pricing Analyst jig, copy those instructions into a new conversation with the model you're testing, upload relevant customer data, and see how it performs compared to your established jig.

Test across different types of work:

Analysis: Can it identify patterns in your data that your current model misses?
Content Creation: Does it maintain your brand voice as effectively as your current setup?
Problem Solving: How does it handle complex, multi-variable business scenarios?
Integration: How well does it work with your existing tools and workflows?

Pay attention to subtle differences. Maybe the new model generates more creative solutions but makes more factual errors. Perhaps it's faster but less thorough. Or it might excel at analysis but struggle with content creation.

Document these observations. You're building a personal database of model capabilities that will inform future decisions.

Step 3: The Stress Test

If the model passes your interview, now it's time for a test. If it’s able to handle your core use cases, how will it handle the edge cases?

Choose a high-stakes but recoverable project. For me, this means writing a client proposal from a transcript or creating an AI playbook, work where I have clear standards for quality and can immediately assess whether the output meets my requirements.

For your business, this might be:

Generating quotes for a real customer inquiry
Creating marketing content for an active campaign
Analyzing financial data for strategic planning
Developing training materials for your team
Building a client presentation for an upcoming meeting

The key is choosing something where you can clearly judge quality and where success or failure has real business impact. This isn't about perfection, it's about whether the model's capabilities justify switching from your current approach.

Set clear success criteria before starting:

Quality: Does the output meet your professional standards?
Efficiency: Does it save time compared to your current approach?
Accuracy: Are the results reliable enough to use without extensive fact-checking?
Integration: Does it work smoothly with your existing workflows and tools?

If the model fails this stress test, stick with your current setup. Better to use a proven tool than chase marginal improvements.

If it passes, you've found a genuine upgrade worth implementing.

Note: For many businesses, designing and deploying these “tests” is a great task for an intern. If you’re building an AI-powered product or service, I’d recommend you investigate programmatic evaluation tools such as Galileo.

Step 4: See Other Models

Don't fall in love with any model. I see businesses (and individuals) get emotionally attached to specific AI tools, defending their choices like sports team loyalties. My perspective is that this dynamic is driving much of the frustration with ChatGPT 5; folks fell in love with the patterns and style of 4o and now are struggling with a new collaborator’s style and approach.

However, the technology is evolving too rapidly for emotional attachments. Your favorite model today might be obsolete in six months. The model you dismissed last quarter might have addressed its weaknesses in the latest update.

Force yourself to regularly test alternatives. Set a quarterly reminder to evaluate your primary model against new releases. Ask your current AI to help identify its own limitations, then test whether newer models address those gaps.

Switch models deliberately when you have evidence of meaningful improvement, not just because something new launches. But also don't stick with familiar tools when better options exist. I compel myself to use all new frontier models for at least one project as they come out, and resist the (extreme) temptation to fall in love with a model. While I still miss that October ‘24 Claude Sonnet 3.5 update (still the best gen AI writer ever, IMHO), I’ve found things to love in (almost) every model I meet.

The goal isn't to find the perfect AI model, it's to continuously optimize your AI toolkit based on current capabilities and your evolving needs.

Beyond the Hype Cycle

Every new model release triggers a wave of breathless coverage about "revolutionary capabilities" and "game-changing improvements." Most of it is noise. What matters for your business isn't theoretical benchmarks or demo videos, it's whether the model helps you generate more revenue, reduce costs, or serve customers better in the specific context of how your business works.

The systematic evaluation approach outlined here cuts through marketing hype to focus on practical business impact. You're not trying to stay current with every AI development; you're identifying genuine improvements that justify the switching costs.

Always evaluate models in areas where you have deep expertise. This lets you immediately spot errors, identify sophisticated insights, and assess whether the model truly understands the domain or is just generating plausible-sounding responses. You can't evaluate a model's financial analysis if you're not strong with financial analysis yourself.

Your current AI setup probably works pretty well for most tasks. The question isn't whether new models are objectively better, it's whether they're better enough to warrant changing your established workflows.

In a world where AI capabilities are advancing weekly, the competitive advantage goes to businesses that can quickly separate meaningful improvements from incremental updates. The approach here gives you a framework for making those decisions systematically rather than reactively.

Your time is finite. New models are infinite. Choose wisely.

✨ ✌🏻 ✨

AI for SMBs Weekly

Each week, you'll get wickedly practical, step-by-step guides on how to use off-the-shelf generative AI to supercharge your business. Subscribe to get a free (!) copy of my AI Models for SMBs Comparison Chart.

Evaluating New AI Models

Step 1:The Screener Test

Step 2: The Interview Process

Step 3: The Stress Test

Step 4: See Other Models

Beyond the Hype Cycle

Step 1:
The Screener Test