| Speaking test is authoritatively recognized as an important part of the languageproficiency examination. In addition, speaking test is a test where the candidates can displaytheir language competence authentically. Therefore, speaking test can not be ruled out by anyof the language test which is acknowledged as scientific and ideal. However, due to thesubjective characteristics of rating in the oral speaking test, it is of great significance to ensurethat there is sufficient objectivity and impartiality during the rating procedure of the oralspeaking test.In reaction to the inconsistency phenomenon in ratings of the large-scale oral speakingtest, this paper came up with the idea of "triple scoring mode" method and verified throughdata analysis that this method played a vital role in ameliorating the rating consistency and thespeaking test reliability. The data applied in this paper obtained from the computerizedspeaking test session of the2011ESL test in College of international studies of HunanUniversity. The results of the oral speaking test which consisted four parts from the candidateswere rated by13experienced raters in total where there were5raters randomly arranged for"the first rating" task and5raters were randomly arranged for "the second rating" task and the3else raters were responsible for "the third rating". In the scoring process, the raters whofulfilled the first rating task and raters who fulfilled the second rating task rated the audio filessimultaneously and respectively without one rater being interfered by another. Then the raterswho responsible the third rating would rescore the audio files when there were discrepanciesgreater than a half level between the two scores rated previously. Multi-facet Rasch Modelunder the IRT theoretical framework was applied here in this paper to investigate thedependability of "triple scoring model" method in the practice of reducing raters’ subjectivity.This study has come to the following conclusions via experimental analysis:Firstly, the raters who were in charge of the first rating and the second rating werestatistically differed in severity, and among these raters rater L performed much too severelywhile rater K and rater C performed much too leniently.Secondly, as to the four tasks in the whole speaking test, the majority of the raters hadexhibited acceptable self-consistency except rater A and rater C whose ratings were beyondthe acceptable Infit Mean Square scope0.5-1.5.Thirdly, from the perspective of inter-rater reliability, the rater measurement reportshowed that raters agreed only on21.9%of the ratings they were awarded which was notsatisfying in exhibiting expected agreement. In addition, results from bias analysis proved thatthere were discrepancies between observed scores and adjusted scores. And the unique bias pattern revealed from each rater could be generally merged into one conclusion that the raterin double rating mode rated severely on relatively low ability candidates and leniently onrelatively more capable candidates.Finally, the results of the third ratings, which did not show the undesirable "overfitting"or "misfitting" traits, satisfied the expected separation index value and Infit mean squarevalue. Moreover, the observed scores generally matched with the adjusted scores.This study had not only offered dependability investigation on "triple rating mode"towards the ESL speaking test and CEPT speaking test in Hunan university but also provideexperimental evidence for the further development and improvement of "triple rating mode"in oral speaking test. Analysis proved that the final results of third rating complied with therequired values of the Rasch model, which is also an evidence for choice of scoring method insubjective language test in the future. |