What is Metamorphic Testing of AI?

Anushree Chatterjee

AI in Testing

AI is not as easy to test as traditional software. Think of attempting to predict each and every possible result for an AI that’s constantly learning and evolving, like an advanced image recognition system. It is incredibly hard, if not impossible, to formally specify what the “correct” answer looks like for every input. This, of course, is the “oracle problem” in AI testing – we frequently do not have a clear, unambiguous “oracle” to tell us whether the AI output is 100% correct.

This is where a smart technique named Metamorphic Testing (MT) comes in. Where traditional approaches may seek a single correct answer, MT examines how an AI’s outputs change when its input changes in some predictable way.

Key Takeaways

Metamorphic Testing solves the AI “oracle problem” by focusing on the relationship between input changes and output behavior. Instead of needing exact answers, it validates if outputs shift logically.
Start with a base test input and output, define a rule for how outputs should change, apply a transformation, and check if the expected relationship holds. This verifies AI logic without needing a ground truth.
Common types:
- Invariance: Output stays the same despite minor input changes.
- Increasing: Input changes logically raise output scores or probabilities.
- Decreasing: Input changes logically lower output scores or probabilities.
Challenge: Defining meaningful metamorphic relations requires domain expertise and abstract thinking.

Let’s delve deeper into Metamorphic Testing.

What is Metamorphic Testing?

If traditional testing struggles to give us a definitive “correct” answer for every AI output, how exactly does Metamorphic Testing (MT) get around this challenge?

At its heart, Metamorphic Testing isn’t about finding a single right answer. Instead, it’s about checking the relationships between inputs and outputs. Think of it like this: if you have a recipe for a cake, and you bake it according to the instructions, you get a cake. Now, if you double all the ingredients, you should get two cakes (or one much larger cake). You might not know the exact weight or perfect texture of that bigger cake without a reference, but you know it should relate predictably to the first one. MT applies this same logic to AI.

Here’s how it generally works:

Start with a ‘Source’ Test: You begin with a regular test case – an input you feed into your AI model, and the output it produces. For instance, if you’re testing an image recognition AI, you might show it a picture of a dog, and it correctly identifies “dog.”
Define a ‘Metamorphic Relation’: This is the clever part. A “metamorphic relation” is a rule that describes how a change in the input should predictably affect the output. It’s a fundamental property that must hold true for your AI. For our dog image example, a metamorphic relation could be: “If I slightly rotate the image of the dog, the AI should still identify it as a dog, and its location in the image should shift accordingly.” You’re not saying “it must be identified as exactly a dog,” but rather, “if I do X to the input, Y must happen to the output’s meaning or structure.”
Create a ‘Follow-up’ Test: Based on your metamorphic relation, you transform your original input to create a new, “follow-up” test case. So, for our dog example, you’d create a new image where the original dog picture is rotated by 15 degrees.
Verify the Relationship: Finally, you feed this transformed input into your AI and check if the relationship described by your metamorphic relation holds true. Does the AI still say “dog” for the rotated image? Do the detected coordinates of the dog also appear to be rotated in the output? If the expected relationship doesn’t hold, then you’ve found a potential bug or inconsistency in your AI, even without knowing the “perfect” answer for that specific rotated image.

Why is Metamorphic Testing Needed?

Let us review the reasons why we need metamorphic testing in AI.

Solving the “Oracle Problem”

Imagine trying to build an AI that can perfectly describe any photograph you show it. How do you test if its description is “correct”? There’s no single, universally agreed-upon answer for every image. This challenge, known as the “oracle problem,” is a major roadblock for traditional testing. With complex AI models – like those used for facial recognition, understanding human language, or even predicting stock market trends – it’s often impossible to have a pre-defined, perfect answer for every conceivable input. MT sidesteps this entirely. Instead of needing a perfect answer, it asks: “If I slightly alter this photo, does the AI’s description change in a way that makes sense?” This shift in perspective is revolutionary for AI testing.

Handling AI’s Unique Nature

Unlike traditional software that often follows strict, predictable rules, AI models can be a bit more fluid and, well, intelligent. They might produce slightly different outputs for very similar inputs, or their decision-making process can be incredibly complex. This “fuzziness” and the sheer number of possible inputs make it extremely difficult to cover all bases with standard testing. MT helps here because it doesn’t try to test every single possibility. It focuses on general principles of how the AI should behave when inputs are related, even if the exact output isn’t perfectly predictable.

Building Stronger, Smarter AI

Beyond just finding bugs, MT plays a vital role in making AI systems more reliable and trustworthy. It helps us uncover:

Vulnerabilities: Are there tiny changes to an input that can trick the AI into making a completely wrong decision? For example, adding a few “invisible” pixels to a stop sign image might make a self-driving car AI think it’s a yield sign. MT can expose these weaknesses. Read: Top 10 OWASP for LLMs: How to Test?
Biases: If an AI is trained on biased data, it might perform poorly or unfairly for certain groups. MT can help reveal these biases by applying predictable transformations to inputs (e.g., changing the skin tone in an image) and observing if the AI’s behavior changes unexpectedly.
Edge Cases: These are unusual or extreme situations that the AI might not have encountered during its initial training. MT’s ability to generate related test cases helps uncover how the AI behaves in these tricky scenarios, leading to models that perform better in the real world.

Most Used Metamorphic Relations to Test AI

The three most widely used categories of metamorphic relations are:

Invariance (Things Should Stay the Same)

This is perhaps the most common and intuitive type. An invariance relation means that if you make a specific change to the input, the core output or its meaning should not change at all. The AI should be “invariant” or unaffected by that particular transformation.

Example 1: Language AI and Synonyms: If an AI classifies text (like deciding if a customer review is positive or negative), changing a word to its synonym should ideally not change the overall sentiment. If “This service was excellent!” is positive, then “This service was great!” should also be positive. The AI’s classification should remain invariant.
Example 2: Image Recognition and Minor Adjustments: If an AI identifies a “car” in a picture, and you slightly brighten the image, or add a tiny, unnoticeable bit of digital “noise,” the AI should still identify it as a “car”. The identification should be invariant to these minor, irrelevant changes.
Example 3: Search Engines: If you search for “best Italian restaurants near me” and then search for “best Italian eateries near me”, the core results should be very similar, if not identical, assuming “restaurants” and “eateries” are treated as synonyms. The search results should be invariant to the synonym substitution.

Increasing (Outputs Should Go Up)

This type of relation checks if a change in the input logically leads to an increase in a certain output value or probability.

Example: Credit Risk Assessment: Imagine an AI that predicts the likelihood of someone defaulting on a loan. A metamorphic relation could be: “If a person’s income significantly decreases, their predicted risk of defaulting on a loan should increase.” You’re expecting a direct, upward relationship between one factor and another.
Example: Spam Detection: If an email is initially classified as “not spam,” but you then add many known “spammy” keywords to its content, the probability of it being classified as spam should increase.

Decreasing (Outputs Should Go Down)

Opposite to “increasing”, this relation checks if a specific input change should logically lead to a decrease in an output value or probability.

Example: Credit Risk Assessment (continued): Following the previous example, if a person’s credit score significantly improves (meaning they’re more financially stable), their predicted risk of defaulting on a loan should decrease.
Example: Content Moderation: If an AI flags inappropriate content, and you remove several offensive words or phrases from a text, the likelihood of that text being flagged as inappropriate should decrease.

Challenges in Making Metamorphic Testing Work for AI

There are some practical hurdles and important considerations you need to keep in mind when trying to put it into practice.

Crafting the Right “Rules” (Metamorphic Relations)

This is arguably the trickiest part. Coming up with those clear, logical “metamorphic relations”, the rules that say how your AI’s output should predictably change when the input changes, requires a deep understanding. It’s not just about knowing how the AI works internally (which can be a black box sometimes!), but also about knowing how it should behave in the real world.

You Need Experts: This often means involving people who truly understand the subject matter the AI is dealing with. For a medical AI, you’d need doctors; for a financial AI, financial experts. They’re the ones who can tell you, “If this patient’s blood pressure goes up just a little, their diagnosis shouldn’t suddenly flip from healthy to critically ill”.
It’s About Logic, not Code: You’re defining a logical rule about behavior, not writing a piece of code that defines an exact output. This abstract thinking can be a different mindset for many testers.

Automating the Process

Once you have your rules, you don’t want to manually create thousands of altered inputs and then manually check if the AI’s output follows the rule. That would defeat the purpose!

Generating New Tests: You need tools or clever programming to automatically take an original input and transform it according to your metamorphic relation. For example, if your rule says “rotate the image,” you need a way for your system to automatically create those rotated versions.
Checking the Results: Similarly, you need an automated way to compare the original output with the transformed output and see if the metamorphic relation holds true. This can be complex, especially if the relationship is subtle or involves changes in meaning rather than just numbers.

Avoiding False Alarms (False Positives/Negatives)

When you set up these rules, you might run into situations where the test indicates a problem, but there isn’t one (a “false positive”), or worse, where a real problem exists, but the test doesn’t catch it (a “false negative”).

Too Strict Rules: If your metamorphic relation is too rigid, even minor, acceptable variations in AI behavior might trigger an “error”. This leads to wasted time chasing non-existent bugs.
Too Loose Rules: If your rules are too vague, they might miss genuine inconsistencies or problematic behaviors. It’s a delicate balance to strike. Carefully crafting and refining your metamorphic relations is key to making sure the tests are useful and reliable.

Fitting into Your Current Workflow

Developing AI often involves rapid iteration and specific development practices (like CI/CD). Integrating a new testing approach like MT seamlessly into these existing routines can be a challenge.

Setup Time: Getting the initial setup for MT (defining rules, building automation) can take time and resources.
Running the Tests: How often do you run these tests? Do they run automatically every time a developer makes a change? Ensuring MT becomes a natural part of your AI development cycle, rather than a separate, burdensome task, is crucial for its long-term success.

Using AI-based Tools for Metamorphic Testing

When it comes to test automation, you can find tools in the market that can help you with it. However, most of them aren’t intelligent enough to help you check MT relationships. While none of the tools available is inherently capable of validating MT right off the shelf, you can definitely take their assistance.

testRigor is an interesting candidate for Metamorphic Testing because of its core philosophy: plain English test steps and AI-driven element identification, aiming for ultra-stable tests. Its capabilities allow you to implement many MT strategies.

The key is to use testRigor’s ability to:

Interact with your AI’s interface (UI or API): testRigor is excellent at interacting with web, mobile, mainframes, and even desktop UIs, and it also supports API testing. This is crucial for sending your original and transformed inputs to the AI.
Use “Reusable Rules” (functions): This is one of the way’s for you to define your metamorphic relations. You can create reusable rules that encapsulate the transformation logic and the verification logic. Read: How to create AI-based reusable rules using testRigor?
Data-Driven Testing: You can use datasets to manage your original inputs and potentially parameters for your transformations.
Plain English assertions and Vision AI: testRigor’s natural language processing (NLP) and visual AI capabilities allow you to make powerful assertions about the AI’s output, even if it’s visual or complex text. These assertions can be used as the rules that validate if your output is the same despite the changes in the input.

Here’s a great examples-based guide demonstrating how testRigor simplifies the testing of AI features – AI Features Testing: A Comprehensive Guide to Automation.

Conclusion

If you’re involved in developing, deploying, or even just working with AI, it’s time to seriously consider integrating Metamorphic Testing into your toolkit. It provides a brilliant solution to a tough problem: how do you test an AI when you don’t always know what the “perfect” answer should be for every possible input? By focusing on logical relationships – how changes in input should predictably affect output – MT cleverly bypasses this “oracle problem.” This unique approach allows us to delve deeper into an AI’s behavior, helping us build models that are not just accurate sometimes, but truly robust, consistent, and dependable.

Additional Resources

You're 15 Minutes Away From Automated Test Maintenance and Fewer Bugs in Production

Simply fill out your information and create your first test suite in seconds, with AI to help you do it easily and quickly.

	Achieve More Than 90% Test Automation
	Step by Step Walkthroughs and Help
	14 Day Free Trial, Cancel Anytime

“We spent so much time on maintenance when using Selenium, and we spend nearly zero time with maintenance using testRigor.”

Keith Powe VP Of Engineering - IDT

Start testRigor Free

Request a Demo