When Software Testing Gets Creative Evaluating Generative AI

Picture this: you’ve always tested software by checking if it does exactly what you designed it to. Now, imagine working with an AI that writes poems, designs images, or even codes on its own. How do you know if it’s doing a good job? That’s the fascinating challenge of generative AI testing.

Author: Ramakrishnan Neelakandan, Google

Why Your Old Testing Playbook Needs an Update

Traditional testing is like following a recipe–does the end result match what you expected? Generative AI is more like hiring a creative chef. They know the basics, but you’re asking for something new and unique each time. This means we need to judge the dish differently:

- Goals Get Fuzzier: We’re not just asking “does this feature work?”. We want an AI that can write a compelling ad or a story with a satisfying ending. Subjective? Absolutely!
- Expect the Unexpected: Old software gave the same answer every time. Generative AI is more improvisational. The same prompt might get you a slightly different poem each time. Our tests need to be flexible.
- It’s Not Just Bugs, It’s Vibes: Yes, the AI-written code might have errors. But we also need to ask, “does this code feel well-structured?” Does the image it made actually match the mood we were going for?

The New AI “Art Critic” Skillset

So, how do we become experts at judging these creative machines? Here’s the shift:

The Human Touch Remains Key: There’s no replacing a real person looking at the output and saying, “this is interesting, avoids rambling, and actually hits the brief”.
Metrics Try to Help: Tools like BLEU (for words) or Inception Score (for images) attempt to quantify how ‘original’ or ‘on-target’ the AI’s work feels. They’re not perfect, but they’re a start.
Don’t Let the AI Lie: These models learn from massive amounts of stuff, including some that’s inaccurate. Tests need to catch where the AI pulls from real knowledge versus making things up.
Guard Against Bias: Sadly, AIs learn human flaws, too. Testers need to be on the lookout for unfair or offensive output through fairness checks and careful review. It’s about preventing harm before the AI’s work goes live.
Was it Useful?: In the end, did the AI help a user? Did that generated poem resonate? Did its ad make anyone want the product? Real-world feedback is the ultimate test.

Developing Your “Art Critic” Skills

How can a tester build these nebulous-sounding skills? Here are some practical ways:

Study Examples: Look at both good and bad examples of what your AI is meant to produce. Start forming your own opinions on what constitutes ‘quality’ in this context.
Feedback Loops: Don’t just judge in a vacuum! Engage with users, designers, or stakeholders who will ultimately use the AI’s output. Get their perspective on what works and what doesn’t.
Iterative Approach: Generative AI is constantly evolving. Regularly review your evaluation criteria as the AI itself learns or is fine-tuned. Your eye for quality will need to evolve in tandem.

Testers, Time to Evolve!

The rise of generative AI demands that testers step away from a purely bug-hunting mindset and embrace the role of ‘AI quality judges.’ Here’s a closer look at the key areas for evolution:

Understanding the Inner Workings: While you don’t need to be a data scientist, gaining a basic understanding of how generative AI models learn is crucial. What data was it trained on? What kind of algorithms does it use? This helps you predict potential failure points and design tests to catch them.
Probability vs. Predictability: Generative AI won’t always give the same output for a given input. Learn about statistical measures like variance and how to design tests that evaluate how consistent the model is in relation to your desired outcome. Are those poem variations all good, or are some way off?
Bias: The Hidden Bug: Unlike traditional code errors, bias is insidious. Testers need to develop a keen eye for spotting unfairness in AI outputs. This means using fairness metrics and testing with diverse prompts and scenarios to root out when the AI reflects unwanted biases picked up during training.
Prompt Masters: With generative AI, the input isn’t just data – it’s the instructions, the prompt that shapes the output. Testers need to experiment here. Can you ‘trick’ the AI into revealing a bias? Can you find prompts that improve the quality of the result significantly?
Beyond Functionality, It’s About Fitness: Does the AI-generated content align with its purpose? Testers become experts at judging whether an AI-written ad is truly persuasive, or if the image matches the mood it was asked to create. This requires understanding the subjective needs of the user.

The Takeaway

Testers who can master this will be in high demand. We’re not just debugging anymore; we’re helping guide AI to create things that could truly improve lives. That’s an exciting mission to be part of!

About the Author: Ramakrishnan Neelakandan is a Healthcare technology professional. He currently works as a Lead Software Quality and Safety Engineer at Google. His areas of expertise are Artificial intelligence quality, safety and security for healthcare