Audio & Video quality testingMarch 9, 2022

No-Reference Image Quality Assessment Algorithm Training

DāvisEngineer

When testing video applications we use video quality assessment algorithms. Image quality assessment is one of the main parts of the video quality assessment. The main goal of image quality assessment algorithms is to provide insights about the quality of the image. There are two main types of image quality assessment algorithms: no-reference and full-reference. The main difference between the two is that full-reference algorithms compare original and degraded versions of an image whereas no-reference algorithms only have the degraded version available.

There are two main types of no-reference image quality assessment algorithms:

Subjective quality assessment no-reference algorithms. These algorithms try to predict human perceived visual quality and are usually constructed using a training approach.
Technical quality assessment no-reference algorithms. These algorithms try to predict image quality features like level of blur or blockiness. They can be constructed using only a mathematical approach and it might not be necessary to use a training approach.

When working with pre-trained no-reference image quality assessment algorithms you might find out that the model is not optimal for your application. The main indicators that the model is not performing well are low correlation and high deviation between predicted scores and subjective scores. In general, there are no universal models that will work for every application. Therefore, you might not get good results when rating video calls with relatively low image quality because the model is trained for high-quality content applications, like video streaming. When seeing bad results for an algorithm you cannot immediately draw a conclusion that the algorithm is bad. Instead, you need to understand what data has been used to train the algorithm.

Training process

Workflow

No-reference image quality assessment algorithm training can be divided into four main steps:

Dataset creation and image labeling.
Feature extraction by image quality algorithm.
Model creation using machine learning algorithms.
Model validation.

No-reference image quality assessment algorithm training process

It is not possible to change the sequence of the main steps because each step is dependent on the previous step.

Dataset

There are two options to choose from when collecting datasets:

Use a ready dataset. There are different publicly available datasets. A few popular examples are TID2013 and LIVE Image Quality datasets. When choosing a dataset you need to do research on what type of images are used, what type of degradations are present, how the images are labeled, and whether they are suitable for your application.
Create a custom dataset using your own data.

The main advantage of using already available datasets is time. Dataset creation is the most time and resource consuming step when training a model. If it is possible to find a suitable training dataset, then you should use it. On the other hand, the main advantage of creating a custom dataset is that you can control the data that will be used for the training, however, you will have to spend more time and resources.

There are two main steps to create a new training dataset:

Dataset collection
Dataset labeling

When creating a new dataset the first step is dataset collection. If you have decided that you need a custom model for your application then most probably the dataset will be created using images generated by the system where you intend to assess the image quality. For example, it could be a video conferencing app where you will be recording received videos from which images for a training dataset can be extracted.

Another approach is to use synthetic data. In this approach, different quality images are generated using degradation algorithms that apply degradations like blurriness and blockiness. The main advantage of this approach is that you can create larger dataset compared to manual image selection, however, you might not get all types of real-life degradations. There are a couple of things you should consider when creating a training dataset:

The dataset should cover the whole spectrum of possible image quality. It is not possible to create a reliable model if you use only good or bad quality images.
Score distribution should be even between different categories. The image below shows good and bad examples of dataset distribution. You should aim to not have differences between different groups higher than 10%.

The second step is dataset labeling. Dataset labeling could be considered as the most vital part in the whole training workflow because it has a direct impact on the model accuracy.

There are two main approaches to labeling data:

Subjective labeling
Objective labeling

Subjective labeling includes gathering subjective scores for each image in the dataset. The main aspects to consider when doing subjective image evaluation are rating scale and preparation of evaluation events. The most commonly used rating scale is the Absolute Category Rating (ACR) scale which is used to measure the Mean Opinion Score (MOS). The MOS is used to rank the overall quality of the image.

Usually the range is between 1–5 where 1 is the worst quality and 5 is the best quality. However, the range could be different, like 1–100, for example. The table below shows image quality categories mapped against the MOS score:

MOS	Image quality
1	Bad
2	Poor
3	Medium
4	Good
5	High

There are two options to choose from if using MOS to measure quality. The first is to use a continuous scale where a person doing evaluation can assign every possible value in the range. The second option is to allow limited options, for example values by 0.25 step. At this stage you need to take into account properties of the no-reference image quality assessment algorithm that will be trained using the data. You need to understand if the algorithm is doing classification where the output is the label of the class or solving a regression problem where the output is a continuous value. The type of subjective labels have to match the type of output. Another subjective rating approach is to compare pairs of images. Some possible answers could be:

Quality is the same.
Quality is a little bit higher.
Quality is a lot better.

If an approach for image comparison has been used then it is necessary to apply algorithms that can assign quality labels to each image using the nonnumerical data. In most cases, algorithms will convert comparison data to continuous scale in the numerical domain. It is also possible to combine categorical and comparison ratings where part of the dataset is also rated using MOS. This means that at the output there is more information about absolute ratings and numerical values can be assigned using both inputs.

To achieve the best results, subjective rating events should be designed following these guidelines:

People doing subjective evaluation should follow the same guidelines. To achieve this goal, you need to create a document where you describe the properties of image quality and provide visual examples. This has to be done because different people might have different assumptions about what is good and what is bad image quality. It is possible to check if all people doing subjective evaluations are making the same assessments by asking them to assess a limited dataset and comparing results between different persons for the same images.
Rating events should be organized in a controlled environment where you can control the lighting, size of monitors, and viewing distance, as these factors can affect the subjective impression of images.
Images should be rated in a random order to avoid a situation where, for example, 100 good-quality images are rated continuously and then 100 bad images are rated continuously. This will produce less reliable results because people tend to adapt to the images that they are rating and when an image of medium quality is presented after a long sequence of good-quality images, it will be rated relatively lower than the actual image quality.

Another option is to do dataset labeling using objective quality metrics like VMAF, PSNR, SSIM, blurriness, and others. If a full-reference metric like SSIM is used, then images in the dataset must have undistorted versions to compare against and they must be alignable.

The main advantages of objective dataset labeling are:

Easy labeling process.
Larger datasets can be evaluated.
More consistent data.

One possible disadvantage is that objective ratings might not fully correlate with subjective ratings. This is important if you intend to train no-reference algorithms to predict subjective quality and can be avoided by doing objective rating mapping against subjective ratings.

After dataset creation, the dataset is split into training and testing dataset. Usually, 70% of the data is used to train the model and 30% is used to validate the accuracy of the model.

Feature extraction

Feature extraction is the first step after dataset creation and labeling. At the input, there is the dataset with images. Each image in the dataset is passed to a feature extractor that extracts features from it. In the output, there is a list of features for the whole dataset. Features are defined by the no-reference image quality prediction algorithm that is being trained. Features can be levels or artifacts, like blurriness and blockiness, or purely mathematical features, like distribution functions.

Prediction model

When creating the prediction model, at the input there is a list of features and corresponding labels. Both are used by the training algorithm to produce the prediction model. Training algorithm is defined by the type of labels. There are two main types of labels—categorical and continuous. In the case of categorical data, classification algorithms are used, and in the case of continuous data, regression algorithms are used. The choice of specific classification or regression algorithms is made by the architects of the no-reference image quality prediction algorithm that is being trained. It depends on the characteristics of quality features being used. At the output, there is a prediction model that can be used to predict the image quality for unseen images.

Model validation

Model validation is done to ensure that the prediction model is making correct predictions and to find out the accuracy of it.

The first step to validate the model is to predict scores for images in the validation dataset using the trained model. It returns predicted values. For each image there is the actual score. In the validation step both values are compared. There are two popular metrics that are used to characterize the accuracy of the model—correlation and deviation. Correlation is an indicator that shows the relationship between two datasets and deviation is an indicator that shows absolute difference between two datasets. Range of correlation is -1 to 1, where -1 is total opposite correlation and 1 is total positive correlation. Correlation greater than 0.7 is considered high. The range of deviation is the range of the model output. If the model is doing predictions in the range of 1 to 10, then the highest possible deviation is 9 and lowest deviation is 0. It is acceptable to have deviation lower than 10% of the prediction range. The aim is to have high correlation and low deviation.

To ensure accuracy of the model, both should be taken into account because it is possible to have high correlation and high deviation which means poor prediction accuracy. The graphs below show correlation and deviation for three different predictions. There is a dataset—-the red line in the graph—where correlation is high and deviation is low. Here, the model is performing accurately. The yellow line represents a dataset without correlation to the actual scores and the predictions are completely wrong. The green line represents a dataset where correlation is high and deviation is high.

Key takeaways

When performing no-reference image quality assessment algorithm training, it’s important that you think about the dataset you will be using. Will you use a dataset that is already available or will you create a custom dataset that better suits the needs of the application? Keep in mind that if you do decide to create a custom dataset, it will take more of your time and resources. Therefore, make sure to allocate enough time for these activities.

In addition, when evaluating the dataset make sure evaluation events are designed following the main guidelines. This way, you can be sure the data is reliable. Unreliable evaluation data can lead to low accuracy of the prediction model. Finally, you need to define the aimed accuracy level for the prediction model. Namely, the minimum correlation and maximum deviation values.

Do you have an application with video capabilities and want to check the video quality? We can help. Get in touch and let’s discuss your project.