Text Difficulty Measuremengt For The English Learning

Posted on:2008-06-19

Degree:Master

Type:Thesis

Country:China

Candidate:J X Wu

Full Text:PDF

GTID:2178360245998048

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

English text difficulty measurement is an important conception in applied linguistics and information processing. It is used in teaching, publishing, search engines and other fields widely. Because there are very rich reading materials in network, how to efficiently find different level of reading materials is a challenge to the text difficulty measurement.This paper introduces an international widely used method witch based on readability formula to measure text difficulty. Usually, the widely used readability formula only have two varies, word length/word frequent and the average sentence length. In this paper, we chose three formulas: Flesch Reading Ease, Gunning Fog Index, Automated Readability Index and we tested them on different levels data, but the results of this method are very poor, so we can't measure text's difficulty using it.Therefore, we focus on building a broadly applicable model of text to measure text difficulty. Vector space model is a typical example of the text expressing witch does not consider the terms'order and expresses a text as vector space of a vector. The text will gain a value through calculation the similarity to the samples by cosine of angle, so it was easier to achieve. This paper bases on the vector space model to measure the text difficulty, solves text difficulty measurement as a question of classification problems. This method has a lot of advantages, one of these is that it is not the result of the dual value but the probability of the entire training set. The second is to provide additional information, such as the terms of distribution. In Feature selection, this paper analyzes several commonly used methods of feature selection such as document frequency, information gain, mutual information, statistics CHI. expect cross entropy, the weight of evidence for text, the odds ratio. The results show the odds ratio is the best method than others, the worst is mutual information. This paper discusses the traditional algorithm of term weighting: TF-IDF, the introduction of among class and inside class factor in term weighting is presented. Experimental results show that the improved algorithms outperformed the traditional methods in F1.At last, this paper inspected three classification algorithms: Rocchio's algorithm, K-Nearest-Neighbor and Naive Bayes. Experimental results of these algorithms indicate polynomial Bayesian method classification F1 is the highest value, reached more than 80%.

Keywords/Search Tags:

text difficulty, readability, vector space model, feature selection, term weight

PDF Full Text Request

Related items

1	Research On Classification Module Of Core Competency Assessment System
2	Research Of Text Categorization Based On Vector Space Model
3	Term Weight-Based Chinese Text Classification Algorithm
4	Research On Chinese Text Categorization Algorithms Based On Technology Text
5	The Research And Implementation Of Chinese Text Categorization System
6	Extraction Of Chi-square Features In Chinese Text Classification And Improvement Of TF-IDF Weight
7	Research On KNN Text Classification
8	Design And Realization Of Text Categorization System
9	Study For Text Categorization Based On Feature Weighting
10	Sparse Bayesian Model Based On Text Classfication