Audio & Video quality testingFebruary 17, 2023

Data Validation as Part of Audio and Video Testing

Kristīna ArājaQA Engineer

According to the ISTQB Foundation Level Syllabus, one of the typical objectives of testing is “to check whether the test object is complete and validate if it works as the users and other stakeholders expect.”

Validation is the verification of data for compliance with specified conditions and restrictions. Data validation answers the question of whether the product is being tested correctly in terms of customer expectations.

Data validation in audio and video testing

At TestDevLab, we have a dedicated data validation team that specializes in audio and video quality tests. During test execution, the testing team records media files such as network traces and audio and video recordings. From captured test files we can get all the necessary test data that we can subsequently process, analyze and provide to the client.

Metrics that are gathered and validated:

Audio: POLQA, Audio delay
Video: FPS, BRISQUE, VQTDL, VMAF, SSIM, PSNR, Video delay
Network: Bitrate, Packet loss
Performance: Battery usage, CPU, RAM, GPU

From recorded audio files we can understand how the sound changes passing through the application on certain devices under different test conditions. After processing the audio files using the POLQA algorithm, we get an estimate of sound quality and a calculated audio latency. According to this data, we can determine the audio quality, which may impact the user experience.

Based on the recorded test video files, we can measure the quality of the video and get metrics such as frame rate, video delay and freeze data. A test file can be a screen recording of the device itself or a video recording of the device with a camera, depending on the conditions and requirements for the testing process.

Audio and video quality can be affected by bitrate changes and the amount of packet loss during data transfer. Network consumption data can be gathered by analyzing captured network traces. A poor internet connection can be caused by a variety of factors, but in our specialized audio and video quality testing laboratories, we can also simulate limited network conditions.

Before providing the data to the client, it is thoroughly checked by the data validation team.

Data validation provides:

Accurate data
Clean and correctly formatted data
Transparent results to clients/stakeholders

What are the requirements for a data validation team member?

To do their job well, the data validation team member must have a thorough understanding of testing processes, metrics, and factors that can influence test results. It is necessary to know what problem their work should help to solve. Perseverance and attention to detail are essential in data validation. Additionally, developed analytical and critical thinking allows data validation team members to identify problems and relationships between individual facts, and make decisions based on the data received.

Data validation team members performing data validation

Here are just a few of the tasks that may be part of the responsibilities of a data validation team member:

Make sure that the data is consistent, logical and of the highest quality.
Analyze test data and compare it with competitors’ data.
Identify inconsistent test results and decide whether they require retesting or further investigation.
Find unacceptable cases in test data and request a retest.

How does data validation work?

Data validation is useful for all projects, but particularly for automation or manual projects that have a larger testing scope. At TestDevLab, our validation team uses a special tool, called a Validation bot, that allows us to cover a larger scope of data. This Validation bot is useful not only for validation tasks but also for traceability. The bot tracks the test count and test metric count in each scenario for traceability of executed tests. For validation purposes, the bot analyzes the results and returns failed tests each day, for example, if any necessary files are missing or the video/audio stream in the test is not present. With the help of the Validation bot, a much larger scope of tests can be validated automatically and we don’t need to perform repetitive tasks. This leaves us with more time to investigate, analyze and validate the dynamic parts of those tests.

The validation process takes place throughout the testing process. When implementing manual tests, we need to make sure that the setup is designed correctly and meets the requirements, and check that the correct OS and app versions are being tested. After the test is completed, the test files are ready for processing. At this stage, we make sure that all files are present and convert the captured test files into test data and graphs. When the data is ready for analysis, we can move on to data validation.

Before proceeding with the validation of the test data itself, we must match file IDs, and check the names and duration of the test files. These steps are an integral part of the validation process.

After confirming that the test files match the test data, we need to make sure that the test data meets the testing requirements.

For audio validation, we need to listen to audio files recorded during the test and subjectively make sure that the audio quality corresponds to the POLQA score.

POLQA is a full-reference algorithm that analyzes and evaluates the degraded or processed speech signal with respect to the original signal. It compares each sample of the reference signal (talker) to each corresponding sample of the degraded signal (listener). The difference between both signals affects the POLQA score. Specifically, the more the degraded audio differs from the reference audio, the lower the POLQA score is. POLQA results principally model mean opinion scores (MOS) that cover a scale from 1 (bad) to 5 (excellent).

We also use tools to measure audio latency to make sure that the calculated results are correct. Audio latency is a part of the POLQA algorithm. This metric shows how long it takes for the end user to receive the audio stream. Each sample that is sent during the call is compared to the one that is received. That way the algorithm generates the average audio delay value for every sample. It is represented in milliseconds.

A graphical representation of the data helps us to determine at what point in the test there was a change in the audio quality. Here you can see an example of POLQA score data for three tests. The average result is about 4.7, which is close to the maximum, however, in the third test in the 19th sample the audio quality drops to a score of 2.7. In this case, we need to determine the reason for the deterioration in quality.

When listening to this sample (pictured below), it turned out that part of the third phrase is missing. After passing through the application, the phrase did not reach the receiver. This is what affected the audio quality in this particular sample.

Audio sample showing reference and degraded data

In our experience we have noticed different kinds of distortion in audio samples. But some cases are considered to be unacceptable and need to be retested or sent for further investigation. For example, we noticed a POLQA score drop in one of the test audio samples for unlimited network conditions. This is an extremely rare and unexpected case that should be carefully investigated. Hearing a ticking sound on sender and receiver audio, disturbing background noise or notification sound are more examples of distortions that should be sent for further investigation.

How do we measure video quality?

To measure video quality, we use two approaches: objective (calculation of metrics) and subjective (poll of experts and calculation of the average result). Objective measurement of quality has many advantages but we need to find out to what extent the results are comparable to subjective quality. To find this out we perform video data validation.

Objective quality assessment is carried out using the full-reference quality metrics: VMAF, PSNR and SSIM. The algorithm principle of these metrics is to compare the reference video to the degraded video and display the difference in the data result.

Let’s take a closer look at the full-reference quality metrics mentioned above:

VMAF

Video Multi-Method Assessment Fusion (VMAF) is a perceptual video quality assessment algorithm developed by Netflix. It predicts subjective video quality based on a reference and distorted video sequence. The metric can be used to evaluate the quality of different video codecs, encoders, encoding settings, or transmission variants. VMAF scores range from 0 to 100, with 0 indicating the lowest quality and 100 the highest.

PSNR

Peak signal-to-noise ratio (PSNR) is an engineering term for the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. PSNR scores are measured in decibels on a scale from 20 to 60.

SSIM

The Structural Similarity Index Measure (SSIM) is a perceptual metric that quantifies image quality degradation caused by processing data compression or by losses in data transmission. It is a full reference metric that requires two images from the same image capture—a reference image and a processed image.

SSIM scores go from 0 to 1, where scores that are closer to 1 indicate that the degraded image is more similar to the reference image. Results from 0.97 to 1 show minimal degradation, results from 0.95 to 0.97 represent low degradation, and results below these ranges indicate medium and heavy degradation.

Example of full-reference evaluation

Here are the scores we got when evaluating this example using these full-reference quality metrics:

VMAF – 76
PSNR – 35.8
SSIM – 0.98

For a subjective assessment of the quality of the video, we examine BRISQUE and VQTDL metrics. The advantage of this method is the ease of interpretation of the obtained estimates, since they are directly related to human perception.

BRISQUE score is measured using a trained algorithm. It calculates the no-reference image quality score using the Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE). For this metric, there is no reference, which means that there is no “one best version” to which we are comparing all other videos. The algorithm was trained using a lot of frames/images from the video.

Video Quality Testing with Deep Learning—or VQTDL—is a no-reference algorithm for video quality assessment developed by TestDevLab. This solution produces image quality predictions that correlate well with human perception and offers good performance under diverse circumstances, such as various network conditions, platforms, and applications.

To confirm the qualitative assessment of subjective metrics, we analyze the test video for artifacts that can affect the assessment.

Examples of image artifacts that can affect and reduce image quality:

Examples of non-reference evaluation

Visible problems but image quality is still acceptable and understandable

Bigger objects can be seen, but more detailed parts blend together

Hard to tell apart bigger details in the pictures

Objects blend together with the background

During the video validation process, we determine whether the received scores correspond to the quality of the test video, and we also compare the two quality evaluation methods to check the correlation between them.

In this example, all test videos have the same data trend, and the deterioration in video quality in the third test reflects both objective and subjective video ratings.

10th second of test measuring VMAF and BRISQUE scores — Test 3, 10th second of test VMAF ~77, BRISQUE ~ 4.2

38th second of test measuring VMAF and BRISQUE scores — Test 3, 38th second of test VMAF ~22, BRISQUE ~1.89

After a full objective and subjective evaluation of the video and additional video analysis performed by a validation team member, we can provide qualitative test data for reporting and detailed information about the application behavior.

For testing, TestDevLab adapts existing algorithms and methods to suit different content differences and create metrics that are more flexible and reliable based on project needs and perspective. Having a data validation team is an essential part of testing. Thanks to the validation team, the following data analysis capabilities are made possible:

Time series graphs per metric

Time series graphs showing VMAF and BRISQUE scores

Average comparison bar charts per metric

Average comparison bar chart for audio delay

Competitive benchmarking heatmaps

Competitive benchmarking heatmap showing POLQA score

Audio spectrogram analysis

Image data histograms

After audio and video testing is complete, we provide the client with test data that has been fully analyzed, compare the test data with the results of competitors, conduct identification and investigation of all issues that we encountered during the testing process, and accurately display the behavior of the application. In turn, this will help them gain valuable and actionable insights about their application and whether it lives up to their expectations as well as the expectations of their users.

How confident are you that your current tests are effective in covering a wide range of scenarios and detecting key issues? Is your team using a variety of metrics to measure audio and video quality? Our team of audio and video quality experts can provide you with useful data that will help you ensure that your solution is at the top of its game. Contact us and let’s discuss your project.