Font Size: a A A

Improved Term-weighting Approach In Chinese Text Classification Over Skewed Data Sets

Posted on:2011-10-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y J ZhangFull Text:PDF
GTID:2178360305489544Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the fast development of the Internet, various kinds of diversified information are growing exponentially everyday, most of these abundant information resources are still exist in terms of text. It is becoming a high research value that how to manage and organize so huge and increasing text information and mining relevant information from which people needed, this problem has been drowning more and more attention all over the world. With this background, text classification based on the machine learning grew with the trend of the times, it has become the important basis and prerequisite in information retrieval, information filtering, search engine, text database, data mining fields and so on, and it has comprehensive application foreground.In the process of the text classification, it includes many key technologies: Chinese Word Segmentation, feature selection, vector space model, classification model, classification evaluation indicator and so on. Most of automatic text categorization based on the machine learning is built on the vector space model (VSM), Text is expressed as the form of computers can recognized in the VSM. Using the feature weight algorithm, we choose the features that play an important role and can represent text better in the text; at the same time we ignore the features that have no contribution to the text categorization. One reason of the above purpose that it can reduce the dimension of the VSM and improve the efficiency of the text categorization, the other reason is that it can choose the better features expressed the text, it can improve the precision. Therefore, text feature weight algorithm is the basis and premise of the text categorization, it has the important position. Following the analysis mentioned above, this dissertation focuses on improving the term-weighting approach. The contributions of this dissertation are listed as follow:Basis concept of text classification and the development at home and abroad are introduced briefly.Introduce the key technology of text classification including pretreatment the text, feature dimension reduction, text representation, classification algorithm and evaluation metric.Introduce the classification term-weighting approach TFIDF and analyze its weaknesses, lay out several improving approaches based on TFIDF, TFIDF-DI is the better one and analyze it.Introduce the concept of the skewed dataset and do the control experiments using the TFIDF and TFIDF-DI, analyze the results and pointe out the shortcoming of these two approaches with the skewed dataset.Propose an improvement method TFIDF-λDI based on the TFIDF-DI and use the KNN algorithm comparing the new approach with TFIDF and TFIDF-DI, the result shows the improvement method has the certain enhancement for the performance of classification.
Keywords/Search Tags:Text classification, Term-weighting, Skewed data, TFIDF, TFIDF-λDI
PDF Full Text Request
Related items