In this blog, we dive into three approaches we used to improve UI testing, making it faster, smarter, and more reliable:
- We combined Computer Vision and Multimodal LLMs to allow AI to interpret both images and text, making tests more intuitive and accurate.
- Integrated Gemini API to utilise its vision models, which allowed us to generate quicker and more precise outputs.
- We experimented with Llama 3.2-Vision, a new vision model, to boost the accuracy of our tests and minimise errors.
Why did we take this approach?
Traditional UI testing methods can be slow, and they often miss the finer details, especially when dealing with complex interfaces. We needed something more dynamic, something that could scale with today’s apps. That’s where these approaches come in. They allowed us to automate and improve testing, reducing human error while adding more depth to our analysis.
In this post, we’ll walk you through how we set up each of these methods, what worked, what didn’t, and how you can apply them to your own projects. We’ll also share real-world examples and include the code so you can jump right into it.
The complete source code for all the approaches is available on GitHub (Approach 1 & 3) and Hugging Face Spaces (Approach 2).
Why UI Testing Matters More Than Ever
Pause for a moment and take a good look at the two screenshots below:


Which one would you prefer? At first glance, it might seem obvious, but ask yourself what exactly makes the newer one so much more appealing? Is it the clean design or perhaps the subtle way it guides your eye?
This is the power of great UI. 
It’s not just about looking moder, it’s about feeling right. If a simple search engine can evolve so dramatically, what does that say about the complex apps we use every day? 
You build an app, test the UI, and hope everything works perfectly. But the reality? It’s not always that simple. UI testing can be time-consuming, error-prone, and difficult to scale, especially as apps get more complex. From the most basic mobile app to a sophisticated web platform, the way users interact with your product can make or break its success. UI testing ensures that your interface not only works but also offers a seamless experience, making it essential for every software product.
The Fundamentals of UI Testing
UI testing ensures every element, from buttons to menus to text fields works as intended. But it’s not just about catching bugs, it's about creating an experience that is intuitive and frustration free for users.
What UI Testing Involves?
Here’s what typical UI testing looks at,
- Making sure buttons, menus, and forms work properly
- Checking that text is easy to read and looks nice
- Verifying that the app responds well to user actions
- Ensuring the layout adjusts correctly on different screen sizes
Without thorough UI testing, even the most innovative app can frustrate users to the point of abandonment.
There are different types of UI testing, each designed to make sure the user interface works smoothly and feels right for the user. Major types are,
- Manual UI Testing: This is when testers explore the UI themselves, checking for bugs, usability, and how responsive everything feels.
- Automated UI Testing: Makes use of automated tools to simulate user actions.
- Visual Testing: As the name speaks for itself, it focuses on how the UI looks making sure layouts, colours, and fonts all appear correctly.
- Functional UI Testing: Checks that every interactive element, like buttons and forms, works properly and takes you where you are supposed to go.
The Challenges of Traditional UI Testing
While traditional UI testing is crucial, it comes with its own set of challenges:
- Manual testing often takes significant time and resources.
- Even the most diligent testers can overlook issues.
- As applications grow, maintaining effective UI tests becomes increasingly difficult.
- Modern interfaces are increasingly complex, making comprehensive testing a significant challenge
- Ensuring cross-device compatibility adds another layer of complexity
The Role of Multimodal AI in UI Testing
To address the challenge of automating UI testing through multimodal approaches, two distinct solution strategies were implemented.
- Computer Vision + Multi-modal LLMs:The first strategy was a hybrid solution leveraging a combination of computer vision and large language models (LLMs), but it was specifically designed for images.”
- Gemini API Integration:The second strategy was a direct integration with the Gemini API, which works well with both videos and images.
- Inference with Llama 3.2-Vision:
- The third approach makes use of the latest Llama 3.2 vision models which are excellent at vision tasks.
All three methods make testing easier for developers. They just need to upload a screenshot and mention the UI element they want to test. The system then automatically creates the relevant test cases.
Implementing Multimodal AI for UI Testing
Approach 1: Computer Vision + Multi-modal LLMs

The first approach uses a general-purpose, open-source Multimodal Large Language Model (MLLM), OpenGVLab/InternVL2-8B to analyse the screenshot and generate valid testcases. However, to significantly improve the accuracy and efficiency of the solution, I incorporated computer vision (CV) techniques for annotating the UI elements within the images.Here's how it works:
- A computer vision model detects and highlights UI components in the screenshot by drawing bounding boxes around them. This preprocessing step helps the language model better understand the specific interface context.
- The annotated image, along with optional user-provided text, is passed to a multimodal LLM called OpenGVLab/InternVL2-8B. By guiding the model with both visual cues and textual information, we observed a remarkable 49% improvement in the accuracy and quality of the generated test cases.
Computer Vision for UI Detection
The computer vision model used for detecting UI components was trained on 578 interface images and can recognize the following element classes:
- Checkbox
- ComboBox
- Radio
- TextArea
- TextField
- Toggle
- Button
The key performance metrics of the CV model are as follows:
- Mean Average Precision (mAP): 99.5% – This metric reflects the average precision across all element categories.
- Precision: 95.1% – The accuracy of the positive detections.
- Recall: 87.7% – The ability of the model to correctly detect all relevant UI elements.
You can view the model details and access it here.
Mind you, now we are not just passing any screenshot, but instead we're feeding the model an image with its UI elements annotated. This provides the language model with a detailed visual context, enhancing its understanding of the interface and leading to more accurate and relevant test case generation.
To illustrate this, let's take a look at a real-world example. Consider the Apple Music Dashboard screenshot below,

After processing this image with the computer vision model, the output looks like this,

The model expertly identifies and annotates all the UI elements, providing a detailed visual context that enhances the language model's understanding of the interface.
For example, if we ask the model to test the 'Home' button, the generated output would look something like this,
With this approach, we observed a 49% improvement in the overall accuracy and quality of the output. This process ensures that the model is guided by both visual cues and textual context, leading to more precise and relevant test case generation.
Among the numerous open-source multimodal LLMs available, the one that stood out was OpenGVLab/InternVL2-8B. This model integrates visual and linguistic capabilities, enabling it to generate rich test cases based on the input UI data and annotated image components. It effectively handles the dual input (image and text), making it ideal for UI testing scenarios.
This approach represents a significant step forward in automating and enhancing the UI testing process for developers.
- Vision part - **InternViT-300M-448px** is an optimized vision foundation model, designed for tasks like feature extraction, especially from complex images. It is a smaller, more efficient version of the InternViT-6B-448px, known for its robust OCR capabilities and dynamic handling of high-resolution images.
- Language part - **internlm2_5-7b-chat** is an advanced 7-billion parameter model designed to excel in practical applications, particularly in reasoning tasks and tool utilization. It surpasses competitors like Llama3 and Gemma2-9B in reasoning benchmarks, such as Math and General Knowledge tests.
The code was executed on Google Colab, with Gradio integrated to provide a user-friendly interface for processing the screenshots. You can find it on GitHub here.
Challenges with this Approach
- A key limitation of this method was that it only processed images, requiring developers to pass a screenshot for every UI element. This made the process quite tedious.
- The multimodal LLM occasionally misidentified or hallucinated non-existent UI elements despite the computer vision preprocessing, leading to inaccurate test case generation.
- Preprocessing images with computer vision and then passing them to the LLM significantly increased inference times, hindering efficiency when dealing with numerous UI screenshots.
- Employing 4-bit quantization to optimize the 8B model's performance reduced its size and computational requirements but noticeably decreased accuracy, compromising the reliability of generated test cases.
Recognising these limitations, we decided to implement a more streamlined and effective approach, which led us to the Gemini API integration.
Approach 2: With Gemini API

The second approach we explored involved a direct integration with the Gemini API. One of the key advantages of this approach is its ability to accept videos, unlike the first approach, which only accepts images.
The Gemini API is capable of running inference on images, series of images, or videos. When provided with such content, Gemini can perform a variety of tasks, including:
- Describing or answering questions about the content
- Summarising the content
- Extrapolating from the content
You can learn more about the Gemini API and how to get started with your own API keys here.
For our project, we utilised the gemini-1.5-pro-latest model. This model, along with Gemini Flash, supports up to approximately an hour of video data. The video must be in one of the following formats:
mp4, mpeg, mov, avi, x-flv, mpg, webm, wmv, or 3gpp.
Google offers a free tier for the Gemini API, which is perfect for testing purposes. This free tier offers lower rate limits, with 15 requests per minute. You can find more information about this here.
In our demo, we processed the user-uploaded videos at 30 frames per second (FPS) for faster inference. However, you can also choose to send a limited number of extracted frames from the uploaded video or pass every nth frame. For example, you can modify the following code to pass every 10th frame to the model:
The response from the model is then mapped with a well-structured prompt that defines the task to be performed. This ensures a quicker response.
Using the Gradio interface to provide a user-friendly UI for easy operation, this approach,offers a powerful and flexible solution for our project.
To illustrate the output of the Gemini API, let's consider the same Apple Music Dashboard screenshot used in the first approach(can pass images as well as videos here).When we pass this image to the Gemini API and ask it to generate test cases for the 'Home' button, the output would look something like this,

To showcase our work, we have created a Hugging Face Space for this project. You can find it running here,
Hugging face Spaces - https://huggingface.co/spaces/mavihsrr/UI-Test_Case_Generator-MLLM
Feel free to test it out on your own dataset and let us know how it goes.
Approach 3 : Inference with Llama 3.2-Vision

The third approach we explored makes use of the latest Llama 3.2-Vision models, which excel at various vision tasks. Developed by Meta, the Llama 3.2-Vision collection consists of pre-trained and instruction-tuned image reasoning generative models in 11B and 90B sizes, capable of processing both text and image inputs to generate text outputs.
Model Architecture and Training
Llama 3.2-Vision is built upon the Llama 3.1 text-only model, an auto-regressive language model that utilizes an optimized transformer architecture. The tuned versions employ supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.
To support image recognition tasks, Llama 3.2-Vision incorporates a separately trained vision adapter that integrates with the pre-trained Llama 3.1 language model. This adapter consists of a series of cross-attention layers that feed image encoder representations into the core LLM.The models were trained on an extensive dataset of 6 billion (image, text) pairs, with a knowledge cutoff date of December 2023. Both the 11B and 90B parameter models support a context length of 128k tokens and utilize Grouped-Query Attention (GQA) for improved inference scalability.
Capabilities and Use Cases
Llama 3.2-Vision models are optimized for a wide range of tasks, including:
- Visual recognition
- Image reasoning
- Captioning
- Answering general questions about an image
These models outperform many available open-source and closed multimodal models on common industry benchmarks. Their ability to process both images and text as inputs enables additional use cases such as:
- Visual Question Answering (VQA) and Visual Reasoning
- Document Visual Question Answering (DocVQA)
- Image Captioning
- Image-Text Retrieval
- Visual Grounding
Output Example
To demonstrate the capabilities of Llama 3.2-Vision, let's consider the same Apple Music Dashboard screenshot used in the previous approaches.

When we pass this image to the Llama 3.2-Vision model and ask it to generate test cases for the 'Home' button, the output would look something like this,
	
The Llama 3.2-Vision model accurately identifies the 'Home' button and generates comprehensive test cases, considering various scenarios and edge cases. The output is well-structured and easy to understand, making it a valuable tool for developers and testers.
I have got to say, the new Llama 3.2 collection has truly impressed me, especially the lightweight models. Their performance is exceptional!
Comparative Analysis
Let’s take a closer look at how these approaches stack up against each other.
AspectComputer Vision + Multi-modal LLMsGemini API IntegrationInference with Llama 3.2-VisionInput TypesImages onlyImages and videosImages, videos and textPreprocessingComputer vision model for UI element detection and annotation (mAP: 99.5%, Precision: 95.1%, Recall: 87.7%)NoneSeparately trained vision adapter for image recognitionModel ArchitectureOpenGVLab/InternVL2-8B: Combination of InternViT-300M-448px (vision) and internlm2_5-7b-chat (language)gemini-1.5-pro-latestLlama 3.1 text-only model + vision adapter with cross-attention layersModel Sizes8Bmid-size multimodal model11BPerformance Metrics49% improvement in accuracy and quality of generated test cases compared to using language model aloneProvides improved accuracy and efficiency in test case generation compared to the first approach.Outperforms both the two approaches.Inference TimeIncreased due to preprocessing with computer vision model and passing to LLM
3mins | 15-30secs (Ran it locally) | <1min (Ran it on 1xA40 GPU) || Quantisation Impact | 4-bit quantisation of 8B model reduced size and computational requirements but decreased accuracy | Not applicable | Not applicable || Key Strengths | Detailed visual context from annotated UI elements enhances language model's understanding | Ability to process videos in addition to images, customisable API requests | Advanced architecture,strong performance on visual recognition, image reasoning, captioning, and answering questions about images || Limitations | Only processes images; occasional misidentification of UI elements; increased inference times; decreased accuracy with quantisation | May misidentify other elements or become confused with them | Recent development, may require more integration effort due to advanced architecture || Scalability | Challenges due to preprocessing and increased inference times when dealing with numerous UI screenshots | Scalable due to API infrastructure, supports customising API requests and utilising AI capabilities | Promising scalability prospects due to ability to handle both image and text inputs, strong performance on visual understanding tasks |
In my experience, when it comes to UI testing, prioritising the quality and relevance of generated test cases is crucial. And that is why I find the Llama 3.2-Vision model to be the standout choice. Its sophisticated architecture and impressive performance in visual understanding make it incredibly effective for generating relevant test cases with just a straightforward prompt.
On the flip side, if you're working under tight time constraints and can afford slightly less comprehensive test cases, the Gemini API Integration approach could be a quicker option.
Ultimately, the best choice really lies on the specific needs of your UI testing project. Balancing accuracy, speed, and resource utilisation is key, and taking the time to weigh these trade offs will help guide your decision.
Future Trends
Following are some key trends and technologies that are set to redefine UI testing,
- AI-Powered Test Automation: AI streamlines automation by dynamically identifying UI elements, adapting to interface changes, and performing visual validations, resulting in more accurate testing.
- Predictive Analytics in Testing: Making use of historical data, AI can predict potential defects, allowing testing teams to focus on high-risk areas. This proactive approach enhances test coverage and improves overall efficiency
- Enhanced Test Case Generation: AI algorithms automatically generate comprehensive test cases based on application requirements and past data, minimising manual effort.
- Multimodal AI Integration: This approach combines text, images, and video analysis, enhancing testing capabilities to validate graphical interfaces and diverse inputs.
Conclusion
As we wrap things up, it's clear that the rise of multimodal large language models is set to transform how we approach UI testing and validation.
The journey doesn’t end here, it’s just the beginning.
If you're looking to build custom AI solutions for your organization or want to transform your AI product ideas into reality, we can help. At Ionio, we specialise in taking your concepts from idea to product. Contact us to start your AI journey today.







.png)

