Font Size: a A A

Automatic And Human Rating Differences In Celst In Guangdong Nmet A Many-facet Rasch Analysis

Posted on:2016-05-04Degree:MasterType:Thesis
Country:ChinaCandidate:L L CaoFull Text:PDF
GTID:2297330479982473Subject:Foreign language teaching techniques and evaluation
Abstract/Summary:PDF Full Text Request
A Since its appearance in 2011, the CELST(Computerized English Listening and Speaking Test) in Guangdong NMET has received increasing attention. As a relatively new spoken test format, it mainly focuses on the measurement of communicative ability and language using behaviors. However, as a subjective test format, the rating depends largely on test rater’s subjective impression, so rating reliability is quite easily influenced by various factors, which makes rater effect the first problem to be improved in order to guarantee rating quality. Thus, effectively control raters’ quality is an important method to insure the quality of subjective tests. However, it is not explicit how good the training effect is, where the raters’ rating bias lies and which aspects of the raters should be directed to improve. In order to improve the rating reliability for this large-scale, high-stakes test, in recent years the decision-making section plans to conduct the automatic rating reform to this computerized English listening and speaking test in Guangdong NMET. Thus the reliability of automatic rating and the differences between automatic and human rating have become serious problems.In recent years, a number of researchers abroad have employed mathematical models to analyze raters’ rating results, and have made some explorations. One of the widely used models is the Many-Facet Rasch Model, which derives from the Item Response Theory in psychological measurement area. MFRM is the extension of the original Rasch model, and it has drawn more facets which can influence the rating results. Moreover, it can conduct individual evaluation of each measuring facet, then check the bias interactive impact between different facets, and can offer systematic and detailed analysis on subjective rating quality.This study made use of the materials in the listening and speaking test in 2013 Guangdong NMET, tested 119 senior three students and analyzed the rating result. The students were rated by three types of raters, aiming to analyze in detail the different types of raters’ rating difference. According to their background, the raters are divided into three groups—college teachers, high school teachers and automatic rater. Through Rasch model, this paper analyzed and explored differences between automatic and human raters on aspects like rater reliability, severity, central tendency, random effect and unexpected ratings. What’s more, it concretely evaluated and compared each type of rater’s severity and reliability, analyzed possible causes of each type of rater’s bias on specific test takers during rating, and extracted abnormal scores. The results showed that all the three types of raters possessed good inter-rater reliability, though the automatic rater indicated less intra-rater reliability than the other two types of raters under the stringent infit limits. There didn’t exist central tendency and random effect among the three types of raters; and the automatic rater and the college teacher rater had a few unexpected ratings to some students on certain item parameters.I hope this research can offer concrete statistical basis for the automatic rating reform of the English listening and speaking test in Guangdong NMET, and encourage the application of MFRM in actual score monitoring.
Keywords/Search Tags:English oral proficiency test, rater effects, automatic rating, MFRM
PDF Full Text Request
Related items