| In today’s multi-modal and multi-media environment, it is of great pedagogical significance to understand the roles of visuals in video-mediated L2listening. The nature and process of audio-visual perception need to be investigated to avoid the superficiality in video use and to find guiding principles for video-mediated listening material writing, class instruction and test development.Motivated by the dearth of studies in this area, the present research aims to find what variables affect comprehension and how they interfere with processing of auditory input by addressing three questions:1) How does visual input affect comprehension processing? Should the ability to use visual information be considered as a component in listening definition?2) What are the different effects of various visual types on test-takers’ listening comprehension?3) Do different visual types have consistent influence on test-takers at different proficiency levels?A Cognitive Multimodal Processing Model is developed as the theoretical framework, which classifies video shots into context visuals and content visuals. The former display kinesic and paralinguistic features of speakers, while the latter depict the represented participants, scenes and events. Key elements in the context visuals including lip shapes and gestures, as exemplified by McGurk Effect, interfere with auditory input at the acoustic, semantic and pragmatic levels of processing. In contrast, content visuals and auditory information are handled by independent processors in the working memory and formulated into visually based models and verbally based models, which are gathered in the central executive of the working memory and built into an integral comprehension. A multimodal discourse analysis further categorizes content visual-audio relationships as convergent, corresponding and divergent. It is hypothesized in this research that context visuals have positive effect on listening comprehension on a theoretical level; for content visuals, as the degree of audio-visual correspondence lowers, the visuals’ facilitative effect decreases and debilitative effect increases. The listening definition including the context visual element is termed as a weak version and the one including the content visuals termed as a strong version.We adopt a triangulated research methodology, using both quantitative and qualitative data to probe into learners’ mental process. Three methods were used to gain breath and depth from different samples:survey, audio/video comparative tests and verbal reports. Their results supplement, clarify and reinforce each other.The survey is composed of one questionnaire for teachers and one for students representing high proficiency (HG) and low proficiency (LG) groups. The survey results illustrate the multimodal teaching realities and problems with three key findings:1) Chinese EFL learners are not apt at utilizing kinesics to facilitate comprehension, so context visuals may not have positive influence on their listening comprehension;2) LG seem to be guided by visuals on global understanding while HG also benefit from visuals in sentential level understanding;3) HG are more positive about video use and demonstrate better audio-visual strategies than LG; teachers lack awareness of visual roles in task designing and strategy training.The audio/visual comparative experiments consist of two tests using content visuals and one test using context visuals.In the former, listening tasks of two clips representing high and low audio-visual correspondences are tested among four groups of test-takers:video-HG, audio-HQ video-LG and audio-LG The effects of audio/video tasks and language levels were examined by SPSS ANOVA tests. The result indicates that video of high audio-visual consistency improves test-takers’listening comprehension significantly. Video of low audio-visual consistency also improves the global idea comprehension but have no facilitative effect on specific information irrelevant to visuals. The broken line graphs indicate visuals of high audio-visual correspondence improve LG performance more than HQ implying that content visuals might reduce the differentiation of listening tests. In the latter, Comprehension and Word Recognition tasks were developed for a talk-show clip to study the effect of kinesic and paralinguistic features on acoustic and linguistic processing. Independent T test indicates no difference between the scores of the audio group and the video group, suggesting that context visuals do not have notable impact on comprehension.The verbal reports involved two subjects from HG and two from LG, who made immediate retrospective reports and interview with the researcher individually. They described their visual and auditory perception and their overall comprehension during pauses of video playing, thereafter they were asked reflective questions concerning the effect of visual elements and their general attitudes to video-mediated listening in class and in testing. This method produced several unique discoveries:1) When linguistic input is difficult, two LG students comprehend through the visual channel and their listening perception, especially bottom-up processing, is hindered;2) Two HG students are more skilled at using visuals to supplement comprehension, especially the top-down process. They also have better audio-visual strategies.3) The information load of the visuals affects listening outcome. If the shots change rapidly with high information load, the debilitating effect outweighs the supportive effect.4) Individual differences like spatial abilities are relevant to the overall comprehension, suggesting abilities other than listening might be tested.These findings are generalized to answer the research questions:2) For context visuals, no significant effect on listening performance is discovered in spite of its positive role assumed in the theoretical model; For content visuals, high audio-visual correspondence significantly improves the verbal message comprehension. Low audio-visual correspondence has positive effect on the global idea understandings, but has no significant influence on the specific information irrelevant to visuals.3): Verbal reports show that the incremental and detrimental effects of content visuals are greater for LG than for HG learners, proving the visual design effect to be stronger among LG learners. The visual supplantation exists among LG learners. These two findings lead to the answer for1):the weak version of listening is justified by environmental, theoretical and empirical evidences, suggesting context visuals be incorporated into the definition of listening comprehension instead of content visuals.Finally, the theoretical and practical implications of this research are discussed. Theoretically, foreign language proficiency level is found to be closely associated with the nature and degree of audio-visual interaction, implying that it should be integrated into the multimodal processing model. Pedagogically, this means that teachers need to match the students’level with the linguistic difficulty to avoid superficiality of video use. Besides, the nature and quality of visuals like visual type classification, audio-visual correspondence or disparity, content visual load, etc. should be attended to in video selection and exercise development. Audio-visual strategy training should be conducted. Context visual materials could be used in tests to achieve good backwash effect. All in all, the investigation into the interaction of visual types, linguistic difficulty and learners’language levels in audio-visual processing should be further carried out in future researches. |