Font Size: a A A

Text Difficulty Measuremengt For The English Learning

Posted on:2008-06-19Degree:MasterType:Thesis
Country:ChinaCandidate:J X WuFull Text:PDF
GTID:2178360245998048Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
English text difficulty measurement is an important conception in applied linguistics and information processing. It is used in teaching, publishing, search engines and other fields widely. Because there are very rich reading materials in network, how to efficiently find different level of reading materials is a challenge to the text difficulty measurement.This paper introduces an international widely used method witch based on readability formula to measure text difficulty. Usually, the widely used readability formula only have two varies, word length/word frequent and the average sentence length. In this paper, we chose three formulas: Flesch Reading Ease, Gunning Fog Index, Automated Readability Index and we tested them on different levels data, but the results of this method are very poor, so we can't measure text's difficulty using it.Therefore, we focus on building a broadly applicable model of text to measure text difficulty. Vector space model is a typical example of the text expressing witch does not consider the terms'order and expresses a text as vector space of a vector. The text will gain a value through calculation the similarity to the samples by cosine of angle, so it was easier to achieve. This paper bases on the vector space model to measure the text difficulty, solves text difficulty measurement as a question of classification problems. This method has a lot of advantages, one of these is that it is not the result of the dual value but the probability of the entire training set. The second is to provide additional information, such as the terms of distribution. In Feature selection, this paper analyzes several commonly used methods of feature selection such as document frequency, information gain, mutual information, statistics CHI. expect cross entropy, the weight of evidence for text, the odds ratio. The results show the odds ratio is the best method than others, the worst is mutual information. This paper discusses the traditional algorithm of term weighting: TF-IDF, the introduction of among class and inside class factor in term weighting is presented. Experimental results show that the improved algorithms outperformed the traditional methods in F1.At last, this paper inspected three classification algorithms: Rocchio's algorithm, K-Nearest-Neighbor and Naive Bayes. Experimental results of these algorithms indicate polynomial Bayesian method classification F1 is the highest value, reached more than 80%.
Keywords/Search Tags:text difficulty, readability, vector space model, feature selection, term weight
PDF Full Text Request
Related items