IEC 62503:2008 pdf download – Multimedia quality – Method of assessment of synchronization of audio and video
5.2 Preparation of test video clips and test video sequence
5.2.1 Selection of content of a test video clip Since lip sync is a kind of human perception, it may depend on the contents of the video and accompanying audio. Especially when it is related to movement of lips of a human speaker, a match between a spoken language and a mother tongue may affect the result.
NOTE In this International Standard, in order to provide worked examples, speech in Japanese language uttered by a well trained processional news reader is watched and listened to by the subjects with the same mother tongue. A bust shot of a news reader shall be extracted, duration of which should be around 1 0 s to 20 s. Data of audio channel of the video clip shall be taken as the timing reference. Possible amount of time caused by miss-synchronization in this original video clip, 1 t Δ at the section 1 -1 ’, is unknown. However, this international standard provides the method to estimate overall lip sync t Δ 3 including t Δ 0 and t Δ 1 . Namely, t t t t Δ = Δ + Δ + Δ 3 0 1 2 .
5.2.2 Creation of a test video sequence The test video sequence shall be a randomised series of the video clip selected in 5.2.1 , in which each of the audio channels shall be replaced by time-shifted audio data with necessary duration of padding as a leader or a trailer depending on the direction of the time shift. Preparation of such video clips is show in Figure 2 as in the image frames with delayed audio and with led audio. The amount of time shifts l T and d T is subject to be adjusted.
To allow for each of the video clips with the time-shifted audio composing the test video sequence to be visually identified by a subject, each video clip prepared in accordance with Figure 2 should be preceded by a necessary number of title frames which include a sequence number. The amount of time shifts for audio data, l T and d T , shall be determined taking into account the sum of the lip sync in reproduction system, t Δ 2 , and possible lip sync in the original video clip, t Δ 1 . The amount of increment and decrement of the time shift T Δ for l T or d T shall be decided in accordance with precision of assessment. In this standard, ∆ T = 1 0 ms is recommended.
The test video sequence should be stored in a medium such as CD-ROM for use in 5.3 without losing audio-video synchronization.
5.3 Procedures and condition for assessment of lip sync at the section 3-3’ The procedures described below shall be followed.
a) The test video sequence being composed of randomized order of the same short video clips of different time-shifted audio (plus and minus) in reference to video shall be reproduced to subjects.
b) The number of the subjects shall be at least 1 5. Each of them shall be asked to report its subjective opinion score for each of the video clips in the test video sequence under a fixed viewing and audible condition.
c) An advance instruction to the subjects shall be provided on the five-grade impairment scale recommended by ITU-R BT.500-1 1 for subjective judgement, as shown in Table 1 .
d) Viewing distance of the reproduced test video sequence L shall be four times of reproduction height H of video frames: L H = 4 .
5.4 Reporting of the result of assessment Taking into account outliers from the original five-grade opinion scores, averaged subjective opinion scores shall be plotted against predetermined audio shift, l T or d T − , as exemplified in Figure 3, with error bars of m c ± , where m is a sample mean of the opinion scores of a set of subjects for the same video clip and c is a 95 % confidence interval of each of the respective mean opinion scores.