In this study an attempt was made to design a computerized adaptive language test (CALT) to assess listening and reading proficiency in English using mixed-format with dichotomous and polytomous item response theory (IRT) models, and to investigate validity issues of the CALT under the assessment use argument (AUA) framework.In order to construct an item pool for the CALT, Study I was carried out where8,203test-takers’item-level responses to15different forms of a computer-based language test (CBLT) were used for item calibration and differential item functioning (DIF) detection. The results indicated that1) the item pool was supported by sufficient unidimensionality when passage-based items were grouped together as polytomous items;2) the construct tapped by the listening and reading sections were distinct from each other, suggesting the need for separate IRT calibrations;3) the generalized partial credit model (GPCM) fit the data of passage-based polytomous items better than the graded response model (GRM);4) approximately12.5%of the items were identified as showing statistically and practically significant gender DIF. The item pool constructed in such a way had relatively flat scale information function, implying that the item pool provided equal precision of measurement for test takers along the ability continuum.The item pool was then combined with the other three components-item selection procedure, ability estimation method, and stopping rule-to develop a CALT system. A total of416test takers drawn from the same sample of Study1took the CALT in Study2. By drawing upon data from test-takers’scores in the CALT, CBLT and CET-4, as well as their self-ratings of computer familiarity, Study2investigated the validity issues of the CALT by examining its factor structure using confirmatory factor analysis (CFA), structural equation modeling (SEM), and multi-group SEM. The results provided strong support for the validity of the CALT with evidence regarding the equivalence of the CALT and CBLT, English ability as a major factor measured in the CALT, as well as the factorial invariance of the CALT across male and female subgroups. These findings suggested the meaningfulness, impartiality and generalizability of score-based interpretations of the CALT desired by the test developers. |