Audiovisual quality of live music streaming over mobile networks using MPEG-DASH

The MPEG-DASH protocol has been rapidly adopted by most major network content providers and enables clients to make informed decisions in the context of HTTP streaming, based on network and device conditions using the available media representations. A review of the literature on adaptive streaming over mobile shows that most emphasis has been on adapting the video quality whereas this work examines the trade-off between video and audio quality. In particular, subjective tests were undertaken for live music streaming over emulated mobile networks with MPEG-DASH. A group of audio/video sequences was designed to emulate varying bandwidth arising from network congestion, with varying trade-off between audio and video bit rates. Absolute Category Rating was used to evaluate the relative impact of both audio and video quality in the overall Quality of Experience (QoE). One key finding from the statistical analysis of Mean Opinion Scores (MOS) results using Analysis of Variance indicates that providing reduced audio quality has a much lower impact on QoE than reducing video quality at similar total bandwidth situations. This paper also describes an objective model for audiovisual quality estimation that combines the outcomes from audio and video metrics into a joint parametric model. The correlation between predicted and subjective MOS was computed using several outcomes (Pearson and Spearman correlation coefficients, Root Mean Square Error (RMSE) and epsilon-insensitive RMSE). The obtained results indicate that the proposed approach is a viable solution for objective audiovisual quality assessment in the context of live music streaming over mobile network.


Introduction
Although a relatively recent development, the evolution and penetration of HTTP Adaptive Streaming (HAS) has been rapid over the last years. This has been driven by the very strong commercial case, as evidenced by proprietary solutions that were initially developed by Apple, Adobe and Microsoft. The common objective across these solutions was to provide a media consumption platform that piggy-backed on existing web infrastructure, and that was client driven and adaptive. This allowed the client to make informed decisions based on a range of factors, principally real time network characteristic estimates, user device type/capabilities, and client preferences. The backend server provides the media for consumption, divided into short chunks of a few seconds and rendered multiple times. The server also provides metadata, both at a semantic level (e.g. genre) and physical level (e.g. media structure/formats/bit rates/video frame rates etc) on its stored media in the form of a Media Presentation Description file. The client first pulls this file and makes decisions based on this and the other variables, listed above. Such a model fits very well with best-effort Internet infrastructure and maps well to user demands to consume media on a wide variety of devices under differing scenarios. The proliferation of these proprietary solutions to meet user needs, and the resulting interoperability challenges, necessitated work on standardization, and culminated in the release of MPEG-DASH (Dynamic Adaptive Streaming over HTTP) standard in 2011 [37,42]. With YouTube and Netflix as key adopters, it has received huge support and adoption rates. Consequently, HAS has been the subject of very significant research that has examined the many variables that make up the full system, and their interaction. A key objective of much of this research is driven by the need to maximize the end user Quality-of-Experience (QoE).
In this research the QoE in a mobile network scenario is studied, considering the particular case of live music streaming. The limiting situation of a concert streaming in mobile devices was chosen, specifically because it represents a very special case where bandwidth limitations can easily appear. The specific influence of both audio and video content quality in the global audiovisual quality perceived by the end user is explored. One of the main goals is to pinpoint possible trade-off strategies, which might provide an alternative to typical MPEG-DASH behaviour, where video holds the dominant role in bandwidth management. In this scenario, the effects of stalling, delay or latency are not considered, as they have already been extensively studied in the past [4,27,31,43]. This paper establishes a methodology for audiovisual quality evaluation of live music concert streaming based on the subjective evaluation initially presented in [35]. Tradeoff strategies for bandwidth allocation under congested network conditions are derived using MPEG-DASH technology, by providing an effective single-valued measure of overall content quality.
The remainder of the paper is organized as follows. In the following section, the related work and motivations for this research are discussed. In Section 3, the proposed framework for QoE evaluation and estimation for live music concert streaming using MPEG-DASH, over mobile networks, is described. Both subjective and objective quality assessment methodologies are described within this section, as well as the proposed audiovisual quality estimation models. Section 4 covers in results analysis and discussion. Subjective test scores are presented and analysed in detail using both one-way and two-way Analysis of Variance (ANOVA) tests, whereas the performance of the quality estimation models is assessed by their correlation with the obtained subjective QoE. Finally, Section 5 provides the final conclusions, as well as future work considerations.

Background & related work
With the huge growth over the last years in multimedia traffic over best-effort IP networks, significant research has been undertaken in both subjective and objective assessment of multimedia quality as perceived by the end-user. However, most studies to date have focused on individual modalities, i.e. audio and video separately. This has resulted in relatively mature and well researched subjective approaches and objective metrics. The subjective approaches include those defined in ITU-T Rec. P.910 [15] and ITU-R Rec. BT.500-13 [17] for video quality, those defined in ITU-R Rec. BS.1116-3 [20] and BS.1534-3 [21] for audio quality and regarding both modalities and those defined in ITU-T Rec. P.911 [13] and ITU-T Rec. P.913 [22] for audiovisual quality. The latter is primarily focused on audiovisual device performance in multiple environments, as well as the quality impact of multiple devices. Regarding the objective quality metrics for audio, these include PEAQ (Perceived Evaluation of Audio Quality) [14], POLQA Music (Perceptual Objective Listening Quality Assessment) [19,34], and ViSQOL Audio (Virtual Speech Quality Objective Listener) [10]. For video, a whole range of quality metrics exist, such as PEVQ (Perceptual Evaluation of Video Quality) [16], VQM (Video Quality Metric) [32], ST-MAD (Spatiotemporal Most-Apparent Distortion model) [46], MOVIE (Motion-based Video Integrity Evaluation) [36], ST-RRED (Spatiotemporal Reduced Reference Entropic Differences) [38], and FLOSIM [28], among others. It is also common to adapt Image Quality Metrics, such as PSNR (Peak Signal-to-Noise Ratio) and MS-SSIM (Multi-scale Structural Similarity index) [48] using the average of frame-wise measurements.
A recent survey on HAS QoE estimation models may be found in [1]. Many of these approaches rely only on video quality measures or take into account only video-related impairments. Tran et al. [43] studied a multi-factor model for quality prediction in HAS over mobile networks. The proposed QoE model relied on three different video-related factors, based on the quality switches, interruptions/stalling and initial delay. In [44], the authors proposed a cumulative quality model, based on quality variation histograms computed within a sliding window of video segments. Duanmu et al. [6] proposed an approach based on the Expectation-Confirmation Theory, in which the instantaneous QoE is evaluated by comparing the intrinsic quality of the current segment with that of the previously viewed segments. The intrinsic quality of a given segment considers both spatial (coming from video quality metrics) and temporal (frame rate) information. More recently, machine learning approaches are also being considered. The authors in [45] used a Long Short-Term Memory (LSTM) network to estimate the overall QoE in the context of adaptive streaming, using input features such as the content-specific characteristics, occurrence and duration of stalling events, and segment quality measure.
Subjective tests have clearly shown that there is a strong inter-relationship between audio and video quality [2], and thus research has progressively focused on developing combined audiovisual models. The authors in [33] focused on the relative importance of the audio and video quality in the audiovisual quality assessment and questioned whether a regression model predicting audiovisual quality can be devised that is generally applicable. They have concluded on the basis of a comprehensive analysis of the available experimental data, covering different application areas ranging from television and UDP-based video streaming to video-teleconference, that audio quality and video quality are equally important in the overall audiovisual quality. Moreover, the application dictates the relative range of the audio and video quality examined and this can result in findings that suggest that one factor has greater influence than the other. This research aims to add to the knowledge base in designing such a joint model for the particular scenario of live music streaming deploying MPEG-DASH.
In [52], a review of audio and video quality metrics is presented, as well as a study of the key issues in developing joint audiovisual quality metrics. In particular, it outlines the common approach to deriving audiovisual quality (AV Q ) from the audio quality (A Q ) and visual quality (V Q ) as follows: where parameters (a 1 , a 2 , a 3 ) denote the different weights of audio and video quality and the multiplication factor (A Q V Q ), with a 0 as a residual term. Despite this seemingly simple approach, this is a significant challenge with many influences and contextual factors. For example, in [8], two experiments were carried out in order to develop a basic multimedia (audiovisual) predictive quality metric. The first used two 'head-and-torso / shoulder' audiovideo sequences and the second one has deployed one of the 'head-and-torso / shoulder' sequences from the first experiment together with a different high-motion sequence as test material. Whilst, the overall result of the studies confirmed that human subjects integrate audio and video quality together using a multiplicative rule, the specific results differed. A regression analysis using the subjective quality test data from each experiment found that: 1. For 'head-and-torso / shoulder' content, both modalities contribute significantly to the predictive power of the resultant model, although audio quality is weighted slightly higher than video quality; 2. For high-motion content, video quality is weighted significantly higher than audio quality.
Finally, two different parametric audiovisual quality estimation models were designed using the subjective quality test data acquired within this research, one called the final 'head and torso' regression model and the second one called the high motion regression model. It is worth noting here that this study has considered neither impairments introduced by UDP-based video streaming nor impairments introduced by TCP-based video streaming such as MPEG-DASH in the subjective tests and thus, these are not reflected in the model development.
Recently, the ITU-T SG12 has finished their work on the work item entitled P.NATS -Parametric non-intrusive assessment of TCP-based multimedia streaming quality, considering adaptive streaming, resulting in a set of recommendations, i.e. ITU-T Rec. P.1203 [23], ITU-T Rec. P.1203.1 [26], ITU-T Rec. P.1203.2 [24] and ITU-T Rec. P.1203.3 [25]. The aim was to develop a collection of objective parametric quality assessment modules that predict the impact of observed IP network impairments on quality experienced by the end-user in multimedia mobile streaming and fixed network applications using progressive download, also including adaptive streaming methods.
In [51], the authors have applied a parametric model based on the approach proposed in the P.NATS for HAS end-user quality estimation. Quality assessment took into account both audio and video bitrate, as well as content length information. Video resolution and stalling events were also input factors. Martinez and Farias proposed in [29] a parametric approach for audiovisual quality estimation, which focuses on RTP-based streaming. In this paper, subjective and objective quality was assessed for different quality levels of audio and video, with constant bitrates. A QoE estimation model was proposed, considering different combinations of audio and video quality metrics.
Considering the previously proposed QoE estimation methods mentioned in this section, it should be noted here that none of them was particularly designed for live music streaming applications.
For subjective quality assessment of HAS and QoE impact factors, comprehensive reviews may be found in [7] and [1]. Most of the published HAS-related subjective studies follow a strong tendency to focus only on the influence of video impairments on the perceived audiovisual quality. Some examples are covered below. In [27], the authors describe a subjective study, which relies on both spatial and temporal quality factors to derive a QoE measure for video HAS. An extensive set of test conditions was created considering temporal factors such as the initial delay and stalling (total duration vs. frequency) and spatial quality of the video content (quality level variations). In this particular case, the average quality level, the number of switches and the average magnitude of the switches were taken into account. The predicted Mean Opinion Score (MOS) provided by the developed user experience model shows a high linear correlation with subjective test results (0.91). A study of the correlation between QoE and Quality of Service (QoS) for an HTTP video streaming scenario is presented in [31]. A set of performance metrics were used, considering both buffering related parameters (initial delay, stalling duration and frequency) and video quality switches. Among the drawn conclusions, it is stated that the temporal structure has a prominent impact on the QoE, with the rebuffering frequency being identified as the main factor affecting MOS. The influence of several factors in the QoE of video streaming over HTTP was studied in [11], through crowdsourcing subjective tests. Besides other relevant conclusions, the results clearly identify stalling events as dominant in the quality perceived by the end-user. Vriendt et. al [5] evaluated the performance of a number of parametric quality prediction models for adaptive video streaming to mobile devices. Subjective tests were carried out using 2 different clips, from which 90 different test conditions were obtained, considering video quality switches between 6 different quality levels. The model parameters were derived using different characteristics of the test samples: nominal bitrate, quality level, PSNR and SSIM SSIM (Structural Similarity index) average and standard deviation and chunk MOS (MOS per quality level). The obtained results indicate that chunk MOS approach provides the best correlation with the subjective MOS, followed by the averaged SSIM. The authors in [41] studied the perceptual impact of quality adaptation strategies and content on the perceived quality of video streaming. A wide range of study cases was created by combining different temporal video bitrate dynamics, initial bitrate conditions, chunk sizes and visual content. The reported results indicate a significant preference for gradual quality changes over long chunks (10 seconds). In [12], the Absolute Category Rating (ACR) methodology was used to evaluate QoE in relation to the scaling dimensions of High Efficiency Video Coding (HEVC/H.265), by varying the frame rate, spatial resolution and codec quantization parameter. Takahashi et al. [39] analysed the impact of the average video bitrate, stalling and the initial loading delay on the cumulative quit rate of users on smartphones with full HD resolution, who were allowed to freely search and change between videos under varying network conditions. The influence of audio presence was investigated by Tavakoli et al. [40], following an evaluation of the video-related impairments previously studied in [41]. This study shows that audio has only a minor impact (a Pearson correlation coefficient of 0.93 between Audio and No audio tests was reported) on overall quality perceived by the end user, assessed according to the methodology defined in ITU-T Rec. P.910 [15]. Moreover, when it comes to quality adaptation strategies, a correlation between MOS obtained for a whole sequence and MOS for processed sequences was always lower when an audio part was involved in the test.
When it comes to optimizing bandwidth utilization, [50] describes the EnvDASH system, an environment-aware adaptive streaming client based on MPEG-DASH that adapts the quality of audiovisual content according to viewing and listening conditions as well as the user's interest. It deploys a sensed environment, sensing separately the viewing and listening conditions as well as the user's interest in the content. This is done in order to reduce network traffic generated by the corresponding streaming service or application in situations where the user is not able to fully enjoy high quality video and audio, e.g. while travelling over rough terrain. According to the experiment presented in the paper, a 5.3% bandwidth saving was achieved with the proposed system over all the subjects/users involved in the experiment.
In the available literature, no study exists that explicitly deals with the impact of audio quality, and more specifically the trade-off in relative bandwidth utilization on audiovisual quality experienced by the end user in the context of HAS. Such insights may be very useful for TV broadcasters and video content delivery providers, such as Netflix, YouTube, Amazon, and Hulu, that are interested in optimizing their client-side quality adaptation strategies. Such insights can inform decisions about the range of both audio and video content quality rendered, so as to provide the end user with the best quality possible considering the mix of corresponding network conditions, user device capabilities, and user preferences. It is worth noting here that with very few exceptions, the quality adaptation strategies up to now have uniquely focused only on adapting the quality of the video content.
Thus, this research deals with the combined effect of varying the quality level of both audio and video content on the audiovisual quality experienced by the end user, in the context of HAS, while considering the particular case of live music concert streaming. To do so, a subjective test has been run according to ITU-T Rec. P.911 [13] simulating a live music concert transmitted over a mobile network with varying congestion levels. In terms of content, recorded live music performances were deployed, as this constitutes a very common use-case scenario. Moreover, this scenario represents a good example of the situation whereby the quality of audio should play a crucial role, i.e. music concert. Insights arising from this study will allow HAS content providers to optimize the use of limited bandwidth in terms of the trade-off between video and audio. Moreover, on the basis of the subjective quality test data presented in this paper, a parametric model was designed to estimate the audiovisual quality experienced by the end user in the context of recorded live music streaming deploying MPEG-DASH. A conceptual diagram of the proposed approach is depicted in Fig. 1.

Source videos and impairment design
Live music performances from two different bands -U2 and Pink Floyd -were ripped from DVD to provide source content. There is a clear differentiation between the content from both bands. In U2 videos, there is constant movement involving fast camera and light changes. On the other hand, Pink Floyd videos have less on-stage movement and both camera and light changes are, generally, slower. In terms of audio, U2 videos have a lot more interference from the audience. The source videos were resized to 480p (854x480), which is standard definition deployed in mobile streaming [3]. It is important to note that the initial source content had a spatial resolution of 720x576 and was resized using the FFmpeg software 1 to match a 16:9 aspect ratio screen, which is the case of the mobile set used for testing. Video resizing and upscaling would be also automatically done by the mobile set in a real life situation.
The video frame rate was 25 frames per second. Four 1 min long sequences were selected (Fig. 2) and cut into 10 second chunks, according to the results reported in [40], representing a typical DASH chunk size deployed by popular streaming services, e.g. Apple HTTP Streaming 2 . FFmpeg software was used to encode demuxed video and audio at different compression rates. Audio chunks were encoded using High Efficiency Advanced Audio Coding v2 (HE-AAC v2) scheme [9], at two different quality levels, i.e. 128 kbps and 24 kbps. 128 kbps is a common bitrate in audio experiments and also extensively used in audiovisual content streaming, which delivers high quality audio. To attempt the introduction of distortions that could affect audio quality, the low end of the HE-AACv2 range of operation was chosen as a low quality level (24 kbps). It should be noted here that stereo audio signals were used in this experiment. Video chunks were encoded with the H.264/AVC video coding standard [49] at three quality levels (H: 512 kbps, M: 256 kbps and L: 128 kbps), which are within the range deployed in [5], with a spatial resolution of 854x480. These bitrates represent multiples of the high quality audio bitrate, i.e. 128 kbps, in order to create a balance in terms of the audio and video quality perceived by the end user, allowing us to study a trade-off between video and audio bitrates in the selected video streaming context. It should be noted that audio and video were synchronous in all experiments, as that issue was not the aim of this study. Fig. 3 depicts the different impairment cases created through the concatenation of the diverse encoded streams into 1 min long mp4 files (6 x 10 sec). These impairments involve different trade-offs between audio and video quality levels, simulating diverse network congestion situations. Audio degradation only is simulated in case 2, while video degradation only is represented by cases 3 and 4. Simultaneous degradation of audio and video is simulated in the remaining cases (1, 5 and 6). As evident, cases 1 and 5 are similar to cases 3 and 4, respectively, in terms of bit rate request. Case 6 includes sequences with lowest total bandwidth level.
Most of the previous research in the context of HAS considers larger datasets of streaming sessions or impairments in their design, e.g. [5,27], which is necessary to draw general conclusions that cover a broad scope of scenarios. In our research, a more specific use-case of HAS, i.e. live music concert streaming to mobile devices, is considered, with a focus on both audio and video quality. Regarding the experiment design, the number of test conditions involved in the test had to be limited in order to avoid participants fatigue in the subjective test (described in the next section), while maintaining reasonable content diversity. Taking into account the extensive studies considering different encoding scenarios and content, we believe that the final range of test conditions lead to representative, plausible and interesting combinations, both in terms of spatial and temporal information.

Test methodology
A single-stimulus study was conducted at the Image and Video Technology Laboratory, Universidade da Beira Interior (UBI) [35]. The study followed the ITU-T standard on subjective audiovisual quality assessment methods for multimedia applications (ITU-T Rec. P.911) [13], which recommends a minimum of 15 participants, in order to obtain statistically reliable data. A total of 32 subjects participated in this study, consisting mostly of students from UBI, from which 21 were male, with ages ranging between 18 and 35 years (mean 24 years), and 11 were female, with ages ranging between 18 and 22 (mean 20 years). Subjects were selected to best represent the target end user group of live music streaming services.
Test sessions, with an average duration of 20 minutes, were carried out in a controlled environment. Subjects were given LG Nexus 5 smartphones (quad-core, 4.95" screen with a resolution of 1920x1080) and stereo headphones (Philips SL3060). The experiment was ran using an Android app designed specifically for this purpose. The app provided full screen visualization of the test sequences, as well as a rating screen (Fig. 4) presented after each visualization, which included a calibrated bar for a nine-level ACR. Like in many real applications, the test content resolution was smaller than the display resolution, and thus an automatic resizing was made to allow a full screen visualization.
Hidden, or non-explicit, references were included in all sessions of the subjective test. The experimental setup did not support the original quality references, due to some hardware limitations associated with the mobile devices deployed in the test, i.e. memory and processing power limitations of the mobile device. Hence, the in-test references consisted in non-distorted sequences, in the sense that both audio and video were kept at the maximum quality levels among the available representations (A: 128 kbps, V: 512 kbps). It Fig. 4 Rating screen in Android App used for subjective tests should be noted here that both audio and video distortions carried by the in-test reference sequences were unnoticeable for the expert test persons involved in preliminary tests, using the mobile device test setup.
There were a total of 24 different impaired sequences involved in the test set (4 different sequences with 6 impairments per sequence) plus the 4 reference versions (hidden references). Considering the relatively long duration of each sequence, a given session included only half of the entire test set to prevent user fatigue and avoid the consequent bias in the results. The test design ensured that each impaired sequence was viewed the same number of times, i.e. 16. Each subject thus attended a single session, in which 12 impaired sequences were randomly presented plus the 4 reference sequences. The actual test session was preceded by 2 training sequences, not included in the test set, which reproduced similar impairments for different sequences of the available content, to adapt the subject to the viewing conditions and context of the test.
With the objective of studying the impact of audio and video distortions on global quality perception, subjects were explicitly prompted to score each test sequence according to their perceived global quality, i.e. taking both audio and video quality into consideration. In this study, the resulting subjective quality for each tested sequence is represented by MOS, which is commonly deployed in HAS studies [5,11,27,31,40,41]. Analysing the differential MOS (DMOS) might help scatter the content dependency of the results and improve the discrimination power of the test. However, it is worth noting here that it was not possible to use DMOS, commonly used in video quality studies, as it was not feasible to obtain subjective quality scores for the original quality sequences. MOS from the different test sequences was compared individually with the respective references MOS, using one-way ANOVA tests. Moreover, individual modality comparisons (i.e. audio-only or video-only impairment sequences) were also studied with ANOVA, to obtain insights on the actual impact of each modality distortion on the global quality perceived by the end user.

Audio and video quality metrics
Objective quality of the test sequences was measured using a set of 6 video quality metrics and 2 audio quality metrics. The metrics chosen for video quality assessment were PSNR, SSIM [47] and MS-SSIM [48], using the averaged frame-by-frame output [5], ST-MAD [46], ST-RRED [38], and VQM [32]. MOVIE [36] and FLOSIM [28] were also considered, however both metrics ran excessively slow and did not provide reasonable predictions with the used dataset. For audio quality metrics, POLQA Music [19,34] and ViSQOL Audio [10] were used. At an initial stage, the PEAQ model, standardized as ITU-R Recommendation BS.1387, was also involved. However, PEAQ failed to provide reasonable predictions, perhaps due to a varied delay/clock drift present in the test signals, caused by different encoding rates, or/and simply by the corresponding implementation of HE-AAC v2 codec. It is worth noting here that PEAQ was not designed for these degradations.
All the used metrics are full-reference methods, i.e. they all include the original signal/reference in the quality assessment process. It should be noted that the metric references are different from those included in the subjective test. Due to limitations of the mobile handsets used for the tests, a choice was made on using sequences continuously encoded at high quality as non-explicit test references. However, these would probably contain coding artifacts, even though at an imperceptible level, that would bring a systematic bias into the objective quality assessment. Hence, the resized maximum quality videos (480p) were used as a reference for metric computation. In the case of video measurements, the Y component from the raw uncompressed YUV-format video sources was used, whereas in the case of audio measurements, uncompressed audio (wav format, Stereo, 44100Hz) was used.

Audiovisual quality model
Another objective of this work was to derive a joint audiovisual model which can effectively characterize/estimate the global quality perception of live concert streaming, as obtained from the subjective tests. As mentioned in section II, the model shown in (1) is a common approach when deriving audiovisual quality (AV Q ) [52]. In this study, a parametric regression was used to fit the normalized MOS data (MOS n ), considering both audio (A Q ) and video (V Q ) objective quality outputs, which were also normalized prior to data fitting. Data range was normalized into [0, 1], using x n = (x − min(x))/(max(x) − min(x)).
MOS predictions (MOS p ) were obtained by mapping MOS n into the resulting models. Data fitting was done using the Curve Fitting tool of MATLAB. Moreover, an extension of the mentioned model was also investigated, with the inclusion of quadratic terms of both audio and video quality metrics. These terms increase the degrees of freedom in the audiovisual model, which are expected to improve the fitting to MOS n and finally the model accuracy. Hence, this extended model is defined as follows: with the addition of quadratic terms for audio (A 2 Q ) and video (V 2 Q ) and respective weight coefficients a 4 and a 5 . This model extends the quadratic model where only a 3 A Q V Q was used. Adding third order terms would lead to extra complexity and would also run the risk of over-fitting.
A goodness-of-fit analysis of each joint audiovisual model was carried out. The coefficients of the fitted models were also analysed for a better understanding of the relative influence of separate audio and video quality in each joint metric. Finally, MOS p for each joint audiovisual metric were computed, to assess which combination of metrics provide the best characterization of the MOS from the subjective tests. The performance of the proposed models is evaluated using a series of statistical evaluation metrics, which include the Pearson Linear Correlation Coefficient (PLCC), the Spearman Rank Order Correlation Coefficient (SROCC), the Root Mean Squared Error (RMSE) and the epsilon-insensitive RMSE (RMSE*), as defined in ITU-T Rec. P.1401 [18], using the subjective evaluation results as baseline.

MOS data analysis
Following the ITU-T Recommendation BT.500 [17], subjective test results were screened to discard subjects whose ratings present a strong shift compared to the average behaviour. According to this analysis, no subject should be discarded. Fig. 5 presents box plots of the subjective scores obtained from the experiment, considering all stimuli (Fig. 5a) and stimuli separated into U2 (Fig. 5b) and Pink Floyd (Fig. 5c). MOS of each impairment is indicated by a circle. A two-way ANOVA test was initially conducted over the entire data (Table 1) to analyse the statistical significance of both  The ANOVA outcome shows that subjects revealed higher sensitivity to the test conditions (F-ratio = 16.77, p < 0.0001) than to the investigated signals (F-ratio = 10.16, p < 0.0001). The interaction between the involved factors is not statistically significant (F-ratio = 1.51, p = 0.0830).
An initial analysis of the data shown in Fig. 5a reveals that case 2 (audio-only impairments) achieved the best MOS after the reference. MOS of case 2 is slightly lower when compared to reference MOS, which shows that listeners were able to detect lower audio quality when no video distortion is present. Nonetheless, this difference is not statistically significant. Moreover, cases 3, 4 and 6, where video quality is dropped to the lower available level for 1-2 10 sec chunks, yield the worst MOS and are quite similar to each other. One-way ANOVA tests (CI = 95%) were carried out over this data (Table 2). Normal distribution of the data was confirmed by a Kolmogorov-Smirnov test [30]. Some important conclusions arise from these outcomes. MOS similarity between audio-only impairment scores and the reference scores was previously discussed. The similarity between MOS of cases 1 and 5 (p = 0.8590) further shows that audio distortions do not affect the global quality perception, even when the distortions span a longer period of time. MOS values of cases 3 and 4 are statistically similar to MOS of case 6 (p = 0.9130 and p = 0.7111, respectively). Taking a look at directly comparable cases, in terms of bandwidth requirements, ANOVA shows a statistical difference between MOS of cases 1 and 4 (p = 0.0120) and also between cases 3 and 5 (p = 0.0022), with higher MOS for the cases with video encoded at 256 kbps and audio encoded at 128 kbps. Based on these results, it is possible to conclude that the perceived quality is not significantly affected by lower audio quality, particularly when video distortions are present. Figure 6 shows data box plots of the subjective test scores with all stimuli for audioonly impairments (case 2) and video-only impairments (cases 3 and 4), in comparison with the corresponding references. It should be noted that for the statistical analysis, reference data consists of the paired reference scores in each test. For example, if a given subject saw  Fig. 6 that video impairments have a greater impact than audio impairments on the quality perceived by the end user. Statistical analysis of these results was also performed, after confirming the normal distribution of the data using the Kolmogorov-Smirnov test [30]. A one-way ANOVA test (CI = 95%) was run to evaluate the statistical significance of the differences between the mean values of the impairment scores and the respective references. ANOVA tests yielded a p-value of 0.4431 for audio-only impairments and a p-value of 1.71×10 −17 for video-only impairments. Therefore, it may be concluded that differences in quality perception in cases with only audio distortion are not statistically significant (p > 0.05), whilst for cases with only video distortion these differences are statistically significant (p < 0.05).
As mentioned in the description of the source content, this experiment included representations of two different contexts. Hence, one-way ANOVA (CI = 95%) was also performed differentiating the content of each band, to analyse the possible influence of the type of content (Tables 3 and 4). Statistical similarities found with undifferentiated content were also found for both U2 and Pink Floyd bands video groups. Regarding the U2 band related data, there is statistical similarity between case 5 and the reference (p = 0.1098). A gradual reduction of total bandwidth requirements to 280 kbps (V + A: 256 kbps + 24 kbps) did not cause a significant loss in the quality experienced by the end user. In this profile, the audio quality variation allows the reduction of total bandwidth to a level close to those of cases 3 and 4 (256 kbps), where global quality perception is affected by low video quality. The results obtained from Pink Floyd sequences suggest that audio distortions are less negligible, which may arise from a higher focus on the specific musical content, as a clearer difference may be seen between MOS from case 2 and the reference in Fig 5c. However, this difference is not statistically significant. Nevertheless, case 5 is also statistically similar to its reference (p = 0.1071), as well as case 1 (p = 0.0507), where an abrupt bandwidth reduction to 280 kbps is simulated. Analysing the directly comparable bandwidth cases (1-4 and 3-5), although MOS values for 1 and 5 (video quality loss compensated with audio at 128 kbps) are still higher than their related cases, a statistical separation of MOS is more apparent for U2 over Pink Floyd content. In fact, a statistical similarity is registered for both of these comparisons in Pink Floyd, with particular relevance for cases 3 and 5. Hence, a  possibility of compensating video quality losses with audio seems to be content-dependent to a certain extent. Some marginal conclusions may be derived from the reported results. Scores of cases 3 and 4 are statistically similar to each other (p = 0.7995). Hence, highly noticeable video distortions cause a great impact on the perceived quality of a given audiovisual stream, with both gradual or abrupt bit rate variations. Furthermore, other marginal conclusions of this study may be drawn. Impairment case 1 of U2 videos, (video bit rate is dropped to 256 kbps), shows statistical similarity with cases 3 (p = 0.0587), 4 (p = 0.0673) and 6 (p = 0.1461), where the video bit rate drops to a 128 kbps. These results show that an identical loss in global quality experienced by the end user may be caused by both smaller or larger variations in video quality, when the visual content involves rapid movement and/or camera changes.

Audiovisual quality model
The coefficients obtained from the parametric model regression are shown in Tables 5 (model 1) and 6 (proposed model). Regarding the model 1 coefficients, a 2 (weight of the video quality metric, V Q ) assumes a higher absolute value than a 1 (weight of the audio quality metric, A Q ) for the majority of metric combinations. This is well in line with the subjective results discussed above, which suggested that video may be the most determining factor in global perceived quality. In the proposed model, a similar tendency is observable as coefficients a 2 and a 5 (weight of the quadratic term of video quality metric, V 2 Q ) have, in general, a bigger absolute value than both a 1 and a 4 (coefficients of the audio metric and the quadratic term of audio metric, respectively). Furthermore, both models also consider a term A Q .V Q , which refers to an interaction between both individual quality measures (a 3 , in the fourth column of Tables 5 and 6). It is interesting to find that this interaction term plays a considerable role in almost every case. Absolute values of a 3 are not negligible and are bigger than the audio coefficients in most cases. However, it is not possible from this study to clearly understand the relation of this interaction factor with the individual metrics nor the individual audio and video quality outcomes. In Fig. 7, the surface fitting of MOS n using parametric regressions of model 1 are shown. Table 7 presents goodness-of-fit parameters provided by the curve fitting tool (R 2 , Adjusted R 2 and Sum of Squared Errors). Based on the results in this Table, Table 8 reports the statistical evaluation metrics for MOS predictions (MOS p ) using both fitted audiovisual models (PLCC, SROCC, RMSE and RMSE*). Given the small number of samples, the t-student 95% confidence interval was considered when computing RMSE* [18].
All audiovisual metric combinations yield relatively good Pearson coefficients between MOS n and MOS p (PLCC > 0. 8 Figure 8 presents surface fittings of MOS n using parametric regression of the proposed model, i.e. (2). The obtained surfaces fit the data in a less rigid manner than the surfaces obtained with (1), suggesting a better approximation of MOS n . As shown in Table 7, R 2 are higher and SSE values are lower for every fitted curve and SSE, when compared with the analogous values from model 1, suggesting an improvement with the addition of quadratic terms.
Adjusted R 2 offers a measure of the explanatory power of adding a term to a given model. Comparing the adjusted R 2 values of both models in Table 7, the inclusion of the quadratic variables effectively increases the fitting of MOS n data for seven of the total metric combinations. However, the adjusted R 2 slightly decays for other combinations.     . This again suggests that video quality plays a more important role in this context than audio quality, as shown in the subjective test presented in this research. It should also be noted that the best metric according to SROCC (POLQA Music with MS-SSIM) did not decrease its monotonic correlation. As differences between the reported correlation measures, RMSE and RMSE* values for all the investigated audio and video metric combinations are small, the corresponding statistical significance tests [18] were performed to specify the significance of those differences. This represents the common Video Quality Experts Group and International Telecommunication Union approach when it comes to a quality prediction/estimation models benchmark. It should be noted here that the SROCC has mostly a non-linear relationship, and therefore the statistical significance test for the correlation coefficients cannot be computed in this case. Table 9 shows that, regarding PLCC, most of the metric combinations are statistically equivalent to the best performing metric combination (ViSQOL Audio with ST-RRED in both audiovisual models). As for RMSE, an efficient discrimination between metric combinations is also not possible, even though a smaller number of performances are statistically equivalent to the best joint metric (ViSQOL Audio with ST-RRED in both audiovisual models).
Taking into consideration these statistical significance tests, RMSE* appears to be the most discriminative performance measure in the case of the both models. As described in [18], RMSE* measures differences taking into account MOS uncertainty. In other words, it measures the scattering of MOS p , as it ignores small differences with respect to an epsilonwide band defined by t × σ , where t refers to the t-student critical value at 95% confidence and σ to the standard deviation of MOS. Interestingly, RMSE* isolates POLQA Music with ST-RRED as the best performing joint metric for model 1 (RMSE* = 0.0203). Regarding the proposed model, ViSQOL Audio with MS-SSIM yields the best result (RMSE* = 0.0407), with two other combinations presenting statistically equivalent results (ST-RRED combined with both POLQA Music and ViSQOL). Table 9 Results of statistical significance tests for PLCC, RMSE and RMSE*. Note: "1" indicates that this metric combination is statistically equivalent to the top performing metric combination (denoted by shaded cells). "0" indicates a statistical difference The experimental setup in this work does not consider short-term temporal quality variations, as the chunk duration is constant and relatively long, i.e. 10 seconds, which is usually considered the best compromise, as in [40]. Based on results from preliminary performance tests, the deployed audio and video quality prediction models were found to be able to provide reasonable predictions of the long-term quality variations introduced by our impairment cases, despite the fact that they were not explicitly designed for taking into account temporal variations of the quality. Furthermore, it is worth reiterating here that the main goal of the developed models was to effectively map the predictions provided by both audio and video quality metrics into a global quality level estimation, which was confirmed by the good correlation coefficients obtained with both audiovisual quality estimation models. However, the accuracy of these models may be further increased, considering audiovisual quality prediction models designed to fully take into account temporal quality variations introduced by HAS. The outcome of the ITU-T SG12 P.NATS standardization effort may contribute vastly to this objective, especially when it comes to video quality prediction models.

Conclusion
The objectives of this paper were two-fold. First, considering the purpose and design of the MPEG-DASH protocol, a main goal was to generate valuable insights into possible trade-offs in relative bandwidth allocation. Therefore, the joint effect of audio and video quality was subjectively assessed, using varying aggregate bandwidth in a recorded live music streaming scenario to emulate a mobile network. It should be emphasized that the drawn conclusions are limited to this case study. In particular, we studied only live concert streaming in the context of mobile devices and networks. As it is well known, typically, mobile devices are not capable of reproducing high quality audio, despite being commonly used by the general public for audio reproduction. For this reason, audio quality can be reduced without a strong impact on the perceived audiovisual quality. If other systems different from mobile ones are used, these conclusions may not be valid, particularly, if higher bandwidth is available.
Typical MPEG-DASH encoding variations were used in the sequences, which were divided in 10 second chunks. Network congestion was emulated for 1-3 consecutive chunks, reducing the bit rate of video, audio or both media respectively. The scores obtained from the subjective test have allowed us to draw the following important conclusions. The reduction of audio content bit rate for a small number of chunks does not affect significantly the global/audiovisual quality perceived by the end user. On the other hand, video bit rate reduction has a greater impact on global/audiovisual quality perceived by the end user. Hence, it is possible to conclude that reducing audio quality to the lowest tested quality allows avoiding video quality reduction without a significant loss on the audiovisual quality perceived by the end user. That would not be the case if the video quality was by a similar bit rate amount, causing a significant perceived quality loss. As a trade-off example, it is preferable to reduce the bit rate of the audio content from 128 to 24 kbps even for two chunks than the bit rate of video content from 256 to 128 kbps for just one chunk. Also, the direct comparison of similar aggregate bandwidth cases indicates that reducing audio bit rate to 24 kbps, with simultaneous video bit rate reduction to an intermediate level (256 kbps), yields a better perceived audiovisual quality than the case involving only video bit rate reduction to 128 kbps.
Based on the results provided by a number of audio and video objective quality metrics, the second main goal of the paper was to derive an effective joint model for estimating audiovisual quality perceived by the end user, in the context of live music streaming over mobile network. A parametric model was proposed, which incorporates quadratic terms of the separate audio and video quality metrics, to extend a commonly used model for the joint audiovisual quality [52].
The obtained performance measures suggest that the proposed approach is valid as a joint audiovisual quality estimation model in the context of recorded live music streaming over mobile network. All tested metric combinations achieved Pearson correlation coefficients above 0.8 and the proposed model globally improved the results of the previous approach. Considering all the discussed performance measures, top performing metric combinations include ST-RRED and MS-SSIM, whether with POLQA Music or ViSQOL Audio.
The conclusions arising from the results presented in this paper may be valuable for the development of new bandwidth adaptation strategies for adaptive streaming over HTTP, which in turn is important for the rapidly growing commercial case of network service providers. However, a similar study with other types of content and devices should be considered in future work, in order to fully assess the applicability of these results to a general case. Unfortunately, to the best of our knowledge, there is no such database publicly available.
Based on the reported results, it also seems to be worth investigating the performance of higher order polynomials, when it comes to the proposed audiovisual quality assessment approach. Furthermore, the reported quality evaluation methodologies might be applied as a basis in the context of omnidirectional audio and video, by varying the quality of the different audio signals and also the quality of different tiles of the omnidirectional video considering the focus of the viewer, in order to keep a reasonable bandwidth consumption.