Video Quality Metrics: Temporal and Spatial Features for Video Quality Assessment

There are many different factors that contribute to a good video call experience. How fluid the content is, what the video quality is like and what factors influence it. At TestDevLab, we have developed various ways to analyze these factors and gather metrics to later assess what makes a good video call.

Temporal features for video quality assessment

Fluidity or frame rate is one of the most important aspects of a video. It is the rate at which consecutive images are displayed. A call with many stutters and freezes can result in a very negative experience, so it is very important to understand how video fluidity is affected in your app under different circumstances. To measure video fluidity, or frame rate, we encode the test video with special ArUco markers that can be set at a desired frame rate. This means that now we have a part of the video that we are sure should always change and we can use it to determine video frame rate under different conditions.

The first implementation of how we got frame rate data was by reading the content of the ArUco markers and decoding the values of those markers. While being error free in most cases, this method proved to be very difficult to use in low quality and high image degradation scenarios, as the ArUco markers could be degraded so much that it would be very difficult to read the marker values correctly. This meant a lot of manual work to try and increase the precision for those cases to deliver correct results. Also, processing for longer or higher resolution videos could take quite a while, as the video had to be converted to frame images that were written to the storage drive and only then read one by one.

Since then, we have developed both a new method for reading frame rate as well as improved the existing one. One of the first big improvements was automatic ArUco marker detection. The older method relied on cropping a static part of the video where the ArUco markers were located and reading that. This meant that there were cases where the cropping region had to be manually changed if the markers were not in the usual left side of the video. Automatic ArUco marker detection eliminates this by searching the video for 4 regular squares in a line. The squares are detected using edge detection between the white padding around the square and the black border of the square itself. The markers won’t be found if there are any obstructions, the white padding is not visible or is too small, or the video is somehow stretched or distorted to the point where the squares are not regular anymore.

The older frame rate reading method has also been improved. One of the previous problems was processing time, as the video had to be converted into frames which were then written to the storage drive. We have also found ways to solve this problem by improving the way the ArUco markers are decoded. This newer method decodes the video and ArUco markers straight to memory, greatly decreasing the processing time. A 3-minute test that would previously take about a minute to process, now takes about 10 seconds to process with decoding the markers straight to memory.

On top of all this, we have also developed a completely different method for gathering frame rate data. The previous method relied on reading the ArUco marker content, which could become problematic in cases of high image degradation as the marker content could be so distorted, it would be very difficult to read correctly and meant that those cases had to be checked manually to increase the accuracy of the results. The new frame rate reading method works by calculating the amount of different frames each second, specifically by finding and checking the area of the given video where the ArUco markers are located. After finding the markers, frame rate is calculated by splitting the video into frames and comparing each frame with the next one for any differences, as well as comparing SSIM (structural similarity index) values of each frame. To ensure correctness of the frame rate data, there is also a confidence metric implemented which compares consecutive frames and decreases if the difference is above a certain threshold. With this method of calculating frame rate, data will be read correctly even in cases of high image degradation where the previous method could have shown incorrect results.

AruCo markers with high image degradation
AruCo markers with high image degradation

Even a frame like the example above would be calculated correctly since this method does not need to read the content of the ArUco markers, just compare it to the next frame and check for differences.

Another metric that contributes to video fluidity is video freezes. A video call with frequent and long video freezes can lead to a very degraded experience, especially in cases of visual presentation.

Our previous implementations of freeze detection have used the ffmpeg freezedetect filter, however, this proved to be quite unreliable. The noise threshold had to be changed often because of changes in video layout, size, and similar factors, as well as cases where noise was detected for a static image which was then considered as movement where there was none.

Our custom method of freeze detection works quite similarly to the newer frame rate calculation method that compares consecutive frames. ArUco markers are detected just like with frame rate and the video gets decoded to 60 frames per second. Freezes are then calculated by checking for differences between consecutive frames. Afterwards this can be used to calculate start time and duration for each freeze.

Spatial features for video quality assessment

Another part of the overall video call experience is the video quality. Here at TestDevLab we have methods to assess the quality of videos with both full-reference metrics and no-reference metrics. Full-reference algorithms need both the degraded and the original video for comparison, whereas machine learning no-reference metrics need a previously created dataset that is afterwards used to evaluate the degraded video.

The main full-reference algorithm we use is VMAF, an open source full-reference video quality assessment algorithm developed by Netflix. It evaluates the degraded video content against the original source video. To achieve correct results, we read the content of the ArUco markers to time align both the reference and the degraded video. Our first implementation of this algorithm decoded the degraded video into RGB .png frames and then time aligned and evaluated the frames. VMAF, however, requires the content to be in YUV format, so this conversion from YUV video, to RGB and again to YUV for evaluation was redundant. One of the bigger improvements for our VMAF implementation was getting rid of this unnecessary YUV -> RGB -> YUV conversion by decoding the degraded video straight to YUV format.

Nevertheless, the biggest improvement we have currently made is switching from using Python to C++. This implementation of the full-reference algorithms decodes the videos straight to memory as opposed to decoding the video to storage. This solution greatly reduces processing times and overall storage requirements. Switching to C++ has also improved the inter-process communication of the algorithm. The first implementations in Python were built on top of the CLI tools of FFMPEG and VMAF and the frames that were evaluated were passed between different separate processes using the disk or pipes. By using C++ we are able to use the underlying C libraries of FFMPEG and VMAF, which means that the decoded frames are used within the same process without the need to copy them between separate processes, making the whole pipeline faster and more efficient.

The first no-reference quality metric we used was BRISQUE—a no-reference image quality assessment metric that measures the naturalness of an image by extracting the natural scene statistics, calculating feature vectors, and predicting the score using machine learning algorithms like SVM (support vector machine). For our purposes, we train BRISQUE—and other no-reference image quality assessment algorithms—using previously subjectively evaluated datasets. This unfortunately takes a lot of time and manual work, as the datasets evaluated have to be quite comprehensive to achieve an acceptable result. Also, since BRISQUE mainly measures the naturalness of an image by fitting the normalized image to a generalized Gaussian distribution curve, if the image does not contain some natural scene, for example, a presentation slide, then the normalized image will not fit this curve and BRISQUE will have difficulties correctly evaluating that kind of content. BRISQUE also suffers from changes in ROI (region of interest), even slight changes can impact the final results. It is also not great at adapting to sudden changes in quality and as a result of that, it is often necessary to manually check different scale parameters for accurate results. Because of these previous problems with BRISQUE, we have developed our own no-reference metric—VQTDL.

VQTDL is a no-reference image quality assessment metric developed by TestDevLab. Based on the TRIQ and Tensorflow computing framework, using ResNet-50 as a backbone, VQTDL is an image quality metric that correlates well with subjective human perception and works well under different network conditions, platforms and applications. It was developed especially for real-time streaming and WebRTC products with the main focus in content being web conferencing and talking head scenarios. VQTDL was tested under many diverse conditions and later compared against BRISQUE. One of the scenarios we tested was changes in position and UI. VQTDL showed much more stable results compared to BRISQUE, where even slight changes in ROI position could affect the quality scores.

VQTDL and BRISQUE MOS scores against changes in cropped area
VQTDL and BRISQUE MOS scores against changes in cropped area

Another area where VQTDL showed very promising results was correlations between different device platforms and user counts. Under different network conditions, VQTDL results maintained a high correlation on all device platforms, higher than 96% in all cases.

The Pearson linear correlation coefficient divided into three test groups: Android to Android, Android to iOS and iOS to iOS
The Pearson linear correlation coefficient divided into three test groups: Android to Android, Android to iOS and iOS to iOS

VQTDL also showed much lower error margins than BRISQUE when it came to multiple user scenarios, especially in 8 user calls where the average error for VQTDL was lower by 75% than the average error for BRISQUE.

VQTDL and BRISQUE PLCC and average error scores
VQTDL and BRISQUE PLCC and average error scores

VQTDL is also based on a much more modern technology than BRISQUE and is able to detect many more underlying features while BRISQUE mostly relies on statistics of locally normalized luminance coefficients to measure the naturalness of the given image.

Overall, our testing has shown that VQTDL provides results that correlate much better with human perception than BRISQUE and can handle many diverse circumstances.

What’s next and how to keep improving video quality assessment

To understand deeper the causes of low image quality, we have also implemented image blurriness as a standalone metric. It is calculated using the OpenCV library Laplacian function. The Laplacian operator highlights regions of an image that contain rapid changes in intensity and is often used for edge detection. In the case of blurriness, the rapid changes that the Laplacian highlights can be used to measure the blurriness of an image. If the Laplacian of an image is high, that means the given image contains clear cases of edge and non-edge occurrences and is not blurry. On the other hand, if the Laplacian of an image is low, there are not many detectable edges in the given image, meaning that it is most likely blurry. There are, however, some drawbacks to this as the results you would get are more relative, meaning that the same score for different videos would be different subjectively. In order to solve this problem, it is necessary to gather a limited amount of media to calculate a baseline for them.

By analyzing these metrics in combination, we are able to determine the usability and points of improvement for the tested scenarios. Both a video call with a very high frame rate but extremely blurry and degraded content, as well as a call that has very minor degradation but is constantly frozen will result in a negative experience. Having these different video metrics allows us to see what exactly are the points of failure and improvement.

We are constantly developing our existing metrics and researching new ones, so that we can test even more new diverse scenarios and further understand the different aspects of video calls.

Do you have an application with audio and video capabilities? We can help you test it using a variety of video quality metrics. Contact us and let’s discuss your project.

Jānis Iesalnieks
Jānis Iesalnieks
Articles: 2