In an age where it seems that almost everything can be measured, calculated, and evaluated as objectively as possible, subjective listening tests still hold their ground and remain a crucial part of audio quality testing. While conducting a plain listening test with a small batch of random participants does not guarantee a fair view of the audio quality assessment from the standpoint of the general population, having a consistent test environment and following standardized evaluation methods greatly improves the objectivity and delivers results that even the toughest critics can agree upon. One such method, defined in the recommendation BS.1534-3 by the ITU-R (International Telecommunication Union, Radiocommunication Sector), is the MUSHRA methodology.
What is MUSHRA?
MUSHRA is an acronym that stands for Multiple Stimuli with Hidden Reference and Anchor. It is a subjective listening test methodology that is most commonly used to evaluate the perceived sound quality for systems in various technology domains. Examples of such systems are Audio Codecs, Headphones, Speakers, Apps, and software that are capable of media playback.
MUSHRA was first introduced to the world in recommendation ITU-R BS.1534-0 in late June of 2001 and has been one of the most revised and well-described subjective audio quality evaluation methodologies. Currently, the 4th revision of this recommendation, ITU-R BS.1534-3, is in force (approved in October 2015).
When is it appropriate to use this testing methodology?
Now that we have accepted the importance of subjective listening tests, and know that there is a methodology to follow, we are ready to crack into conducting the tests, right? Unfortunately, no…
The often disappointing reality is that MUSHRA is not the only subjective audio evaluation method specified by ITU-R, nor by other organizations. In fact, the readers, who previously might have looked into the vast amount of recommendations issued by ITU-R, know that there are at least a handful of subjective audio evaluation methodologies that seemingly are the same, however, contain slight nuances, depending on various factors. Some of these methodologies, and their differences, are given a quick look further on.
So, when is it appropriate to use MUSHRA tests?
As stated in the recommendation itself: “MUSHRA is to be used for assessment of intermediate quality audio systems” (ITU-R BS.1534-3, p.3). To better understand what is considered an “intermediate audio quality system”, yet another quote from the recommendation can be used: “…that new kinds of delivery services such as streaming audio on the Internet or solid-state players, digital satellite services, digital short and medium wave systems or mobile multimedia applications may operate at intermediate audio quality;” (ITU-R BS.1534-3, p.1).
Now, how can we determine if the system produces small, intermediate, or significant audio quality impairments?
Generally, the rule of thumb is – if the audio quality issues do not affect the listening experience of the user and are mostly perceivable only by trained experts – it is a small impairment. If the audio quality issues completely ruin the listening experience, and make the audible content hard to understand, even to an untrained person – it is a significant audio quality impairment. If the system and the audio quality behavior are positioned somewhere in between the aforementioned range – it is an intermediate audio quality impairment.
To add more context to this – 9 times out of 10, audio issues that are considered as small impairments are caused by the acoustic properties or the physical limitations of the system under test, and generally do not impact the listening experience.
An illustrative example of such systems might be a speaker that produces a certain undertone or slight distortion when it is playing at a high volume level; or a noise-canceling headphone that lets through the smallest amount of background noise at playback but doesn’t affect the quality of the playback itself. In such occasions, the ITU-R BS.1116-3 recommendation could be followed instead.
However, if it is expected that the system under test might produce audible defects and a noticeable drop in quality, that affects the listening experience – ITU-R BS.1534-3 recommendation must be followed and MUSHRA tests should be applied.
An illustrative example of systems with intermediate audio impairments could be a video conferencing service that produces audio artifacts such as pops, glitches, and occasional dropouts as the conversations are held; or audio streaming services that limit the frequency bandwidth during playback according to the network conditions.
It’s important to note that these examples are for illustrative purposes, and the categorization of impairments may vary depending on the specific context and the degree of impact they have on audio quality – each case should be approached individually.
Process of a MUSHRA Test
Firstly, a group of participants needs to be established – ITU-R recommends that a listening panel of 15 to 20 experienced listeners should be sufficient. Listeners should go through a pre-screening process according to the ITU-R BS.2300-0 recommendation. Essentially, this recommendation is used as a checklist to make sure that listeners are experienced, reliable, and competent enough to participate in the audio quality issues and evaluation process. The listening panel takes part in a pilot test with the goal of getting acquainted with the range of test objects and audio artifacts. Only the scores of experienced listeners (according to the aforementioned recommendation) are included in the final data analysis.
A single MUSHRA listening test trial consists of
- A reference signal – the original test object in its purest form, usually marked and separated from other stimuli [stimuli – something that causes a reaction in participants, the sounds intended for evaluation in this case].
- A hidden reference signal – the same reference signal, but anonymized and randomly mixed within the rest of the stimuli.
- At least two or more “anchor” signals – the first anchor is a low-pass filtered version of the original signal with a cut-off frequency of 3.5 kHz, and the second anchor (mid-range) has a cut-off frequency of 7 kHz; Anchor signals must also be anonymized and mixed within other stimuli. For scenario-specific cases, more anchor signals can be used. Refer to recommendations ITU-T G.711, G.712, G.722, and J.21 for more details.
- Up to 9 signals under test – i.e. test objects. If more than 9 signals are intended for evaluation, a blocked test design, where similar test objects are grouped in multiple blocks, may be required.
In order to avoid fatiguing the listeners, ITU-R recommends the test material be 10 to 12 seconds long. The material must be neutral – it shouldn’t distract the listener from the test and it should be deliberately designed in a way that puts the sound reproduction equipment under stress. The best approach is to use something similar to what listeners are accustomed to hearing every day – music, speech, homogenous nature sounds, etc.
Test material is presented in trials. The amount of trials is dependent on the number of test materials to be evaluated.
Let’s say you want to analyze and compare the quality of 5 different audio encodings. To follow the best practices, you would probably have more than 1 test sample – for this example, let’s say you want to evaluate them with 4 different songs. The first test trial would have song #1 in 5 conditions (the encodings) alongside the reference and anchors. The second test trial would have song #2 in 5 conditions, and so on and so forth until all 4 songs in all the different encodings are evaluated.
Keep in mind that, although variation in sound reproduction states (mono, stereo, etc.) is encouraged, each reproduction state should be presented separately.
Grading Logic + Tool
The listeners evaluate all stimuli according to the continuous quality scale (CQS) – a vertical scale, ranging from 0 to 100, that is divided into five equal intervals (starting from the bottom: Bad, Poor, Fair, Good, Excellent). The objective is to listen to the stimuli and assign the MUSHRA points score to them. The samples can be played multiple times and A/B listening (comparing to the reference sample or between each other) is allowed.
Since this is mostly a visual process, grading software is commonly used. Ideally, the grading software should have a graphical user interface that consists of an array of stimuli, where each sample has a button to select the sample and a vertical slider used for grading, the score of MUSHRA points is visible above the slider. The GUI also has play/pause buttons and, occasionally, a loop area where a specific segment from the sample can be highlighted and played.
The listener rates the stimuli based on various sound attributes and characteristics. Each sound reproduction state has different sound attributes, however, all of them share one sound attribute which always should be taken into account by default – the basic audio quality. According to ITU-R BS.1284: “It includes, but is not restricted to, such things as timbre, transparency, stereophonic imaging, spatial presentation, reverberance, echoes, harmonic distortions, quantization noise, pops, clicks, and background noise.” (ITU-R BS.1284, p.4)
Aside from the basic audio quality attribute, here is a detailed list of the grading criteria for different sound reproduction state systems: [reference to the bs.1534-3]
- Monophonic system – Basic audio quality: This single, global attribute is used to judge any and all detected differences between the reference and the test object.
- Stereophonic system – Basic audio quality and, additionally,
- Stereophonic image quality: This attribute is related to differences between the reference and the test object in terms of sound image locations and sensations of depth and reality of the audio event.
- Multichannel system – Basic audio quality and, additionally,
- Front image quality: This attribute is related to the localization of the frontal sound sources. It includes stereophonic image quality and losses of definition.
- Impression of surround quality: This attribute is related to spatial impression, ambiance, or special directional surround effects.
- Advanced sound system – Basic audio quality and, additionally,
- Timbral quality:
- The first set of timbral properties is related to the sound color, e.g. brightness, tone color, coloration, clarity, hardness, equalization, or richness.
- The second set of timbral properties is related to sound homogeneity, e.g. stability, sharpness, realism, fidelity, and dynamics.
- Localization quality: This attribute is related to the localization of all directional sound sources. It includes stereophonic image quality and losses of definition. This attribute can be separated into horizontal localization quality, vertical localization quality, and distant localization quality. In the case of the test with accompanying picture, these attributes can be also separated into localization quality on the display and localization quality around the listener.
- Environment quality: This extends the attribute of surround quality. This attribute is related to spatial impression, envelopment, ambiance, diffusivity, or spatial directional surround effects. This attribute can be separated into horizontal environment quality, vertical environment quality, and distant environment quality.
- Timbral quality:
Although engaging in physical listening tests is exciting, arguably, the more important part of MUSHRA tests is the analysis phase. With all the gathered data from the listening trials, it is necessary to apply methods of statistical analysis to get the best sense of tendencies and form valid conclusions.
While it is generally advised to adopt specific analysis methods for specific scenarios, without going into too much detail about each method, here is a list of the most commonly used analysis methods:
- Assessor post-screening – It should be less regarded as an analysis method, and more like a mandatory activity of a MUSHRA test that contributes to data analysis. It essentially limits the amount of data outliers, by:
- Excluding the assessors that rated the hidden reference with a score lower than 90 for more than 15% of the test items from the aggregated responses.
- Excluding the assessors that rated the mid-range anchor with a score higher than 90 for more than 15% of the test items from the aggregated responses.
- Assessor analysis with eGauge – This is an analysis model that evaluates the reliability and discrimination of an assessor as well as the agreement rating between a listener and the rest of the panel.
- Raw data visualization – This is the initial step to the test data analysis and could be categorized as an exploratory analysis method. It may incorporate the use of Histograms with a fitting curve for a normal distribution, Box Plots, or Quantile-Quantile plots (Q-Q plots). The goal of this method is to ensure that there are no obvious data outliers and to ensure overall data quality.
- Analysis of Variance (ANOVA) – In essence, the purpose of ANOVA is to compare the mean values of test conditions and determine the significant differences. Although the steps may vary when choosing different variations of the ANOVA model, it usually involves calculating mean opinion scores (MOS), formulating the null and alternative hypotheses, determining the threshold for statistical significance, and calculating the F-statistic and p-value. For a detailed step checklist, please refer to chapter 9.3, page 16 of ITU-R BS.1534-3 recommendation.
- Post-hoc tests – These tests are a follow-up activity to the ANOVA analysis in cases where the dependent factor is significant in ANOVA. Generally, these tests are used to gain additional insight into specific pairwise differences between the test conditions. There may also be cases where there is no need for post-hoc tests.
Keep in mind that certain aspects of statistical analysis might appear daunting to someone with little experience in that field. Furthermore, performing manual calculations might not always be practical, even for people with experience, therefore, given how many different software solutions for statistical analysis are available (across all price ranges), it is a good idea to outsource as much of the analysis process as possible to these services.
Conducting tests is pointless if nothing tangible is gained from them, therefore, recommendation ITU-R BS.1534-3 has quite a comprehensive description of what to include in the report, how to interpret the results, and what should be emphasized.
First and foremost, the report must convey the rationale for the study, the methods used, and the conclusions drawn. As such, it is highly recommended to prepare the descriptions of the test objective, test materials, and test design prior to the evaluation process – the level of quality in the definition of the test ultimately dictates the quality of the test results and reproducibility.
As mentioned earlier, the report must also include any deviations from the standard procedure along with an explanation of all countermeasures taken to substitute for it.
Secondly, all applicable analysis methods must be followed up with the correct visualization of data. Each analysis method has its own visual representation logic, but most often, the data will be visualized with column charts and various plots. It is important to present the visualizations in a user-friendly way so that, regardless of the reader’s experience, it’s easy to get the relevant information.
After that, it is good to compose an interpretation of the data, noting down all the observations, affirming or denying the hypotheses, forming an opinion, or giving a recommendation – essentially, concluding the test. It is best to attempt presenting this section from as many perspectives as possible – a naive and/or expert listeners’ perspective, or a cost-saving/practicality standpoint, while also making sure to educate the reader of any risks associated with the outcomes of the test.
Lastly, a summary of the test environment is given. It basically consists of information about the equipment and software used for tests, as well as the specification of the listening premises – whether the room complies with any standards that are relevant to the test, the attributes, such as area, window/door/furniture count, wall or isolation materials are also mentioned. Besides that, it is also advised to describe the acoustic properties of the environment – background noise level, reverberation time, etc.
Similar Test Methods
Now that we are, hopefully, well-informed about the MUSHRA test methodology, for the sake of curiosity, a small glimpse into similar test methods can be taken! The following is a small list of analogous testing methods:
- ITU-R BS.1116-3 – The methodology outlined in this recommendation addresses the evaluation of systems with small audio impairments. The overall process is similar to the one in the MUSHRA methodology, however, there are differences in the evaluation logic – the tests utilize a different grading scale and the listeners are only presented with 3 stimuli. This recommendation also has stricter guidelines on the permissible test environment and other factors, hence why it is used as a base for other methodologies.
- ITU-R BS.2132-0 – The methodology outlined in this recommendation mirrors many aspects of the MUSHRA method, however, excludes the presentation of reference and anchor signals. This method can be used in cases where it is not important or not possible to compare stimuli to the reference. In addition to overall subjective sound quality, the listeners also rate the severity of pre-determined sound attributes.
- ITU-R BS.2126-1 – The methodology outlined in this recommendation addresses the evaluation of sound systems with an accompanying picture. This method goes hand in hand with the ITU-R BS.1116-3 recommendation, yet the main emphasis is put on assessing the spatial and time properties of a sound when it is combined with a video.
Taking everything into account, the ITU-R BS.1534-3 recommendation outlines a robust method in which subjective audio evaluations can be approached.
Notice how some steps in this blog post were described rather loosely – the truth is that the MUSHRA test methodology is relatively flexible. Everything depends on the objective of the tests.
- Do you wish to have a larger listening panel than recommended? – You are more than welcome!
- Do you wish to tailor the listening conditions for a specific context? – Certainly doable!
- Do you wish to apply non-parametric or any other analysis method to the data? – You can!
As long as the rest of the base principles are followed and all the deviations from the standard procedure are documented and presented in a way where the test results could, hypothetically, be reproduced by someone on the other end of the globe, the approach would be absolutely valid.
There is no doubt that objective audio tests are important, and moreover, are proven to be efficient and effective, but at the end of the day, the thing we are trying to perfect so hard is still going to be perceived by humans. Knowing that it only makes sense that the desired object goes through a couple of pairs of trained ears before it reaches a wider audience. That said, both subjective and objective audio tests have their place and should be used in synergy to achieve the best audio quality.
Do you have a system that would benefit from subjective audio evaluation? Maybe you are also looking to pass your system by a couple of trained ears before it reaches a wider audience. Our team of audio and video quality experts can help you perform audio and video quality testing to ensure that your solution is at the top of its game! Contact us and let’s discuss your project.