Improved Term-weighting Approach In Chinese Text Classification Over Skewed Data Sets

Posted on:2011-10-24

Degree:Master

Type:Thesis

Country:China

Candidate:Y J Zhang

Full Text:PDF

GTID:2178360305489544

Subject:Computer software and theory

Abstract/Summary:

With the fast development of the Internet, various kinds of diversified information are growing exponentially everyday, most of these abundant information resources are still exist in terms of text. It is becoming a high research value that how to manage and organize so huge and increasing text information and mining relevant information from which people needed, this problem has been drowning more and more attention all over the world. With this background, text classification based on the machine learning grew with the trend of the times, it has become the important basis and prerequisite in information retrieval, information filtering, search engine, text database, data mining fields and so on, and it has comprehensive application foreground.In the process of the text classification, it includes many key technologies: Chinese Word Segmentation, feature selection, vector space model, classification model, classification evaluation indicator and so on. Most of automatic text categorization based on the machine learning is built on the vector space model (VSM), Text is expressed as the form of computers can recognized in the VSM. Using the feature weight algorithm, we choose the features that play an important role and can represent text better in the text; at the same time we ignore the features that have no contribution to the text categorization. One reason of the above purpose that it can reduce the dimension of the VSM and improve the efficiency of the text categorization, the other reason is that it can choose the better features expressed the text, it can improve the precision. Therefore, text feature weight algorithm is the basis and premise of the text categorization, it has the important position. Following the analysis mentioned above, this dissertation focuses on improving the term-weighting approach. The contributions of this dissertation are listed as follow:Basis concept of text classification and the development at home and abroad are introduced briefly.Introduce the key technology of text classification including pretreatment the text, feature dimension reduction, text representation, classification algorithm and evaluation metric.Introduce the classification term-weighting approach TFIDF and analyze its weaknesses, lay out several improving approaches based on TFIDF, TFIDF-DI is the better one and analyze it.Introduce the concept of the skewed dataset and do the control experiments using the TFIDF and TFIDF-DI, analyze the results and pointe out the shortcoming of these two approaches with the skewed dataset.Propose an improvement method TFIDF-Î»DI based on the TFIDF-DI and use the KNN algorithm comparing the new approach with TFIDF and TFIDF-DI, the result shows the improvement method has the certain enhancement for the performance of classification.

Keywords/Search Tags:

Text classification, Term-weighting, Skewed data, TFIDF, TFIDF-Î»DI

Related items

1	Tfidf-based Text Classification Algorithm Research
2	Research On KNN Text Classification And Term Weighting Algorithm
3	Research On Term Weighting Approach Based On Information Gain And Entropy
4	Application Of Improved TFIDF Algorithm In Text Analysis
5	Improvement And Application To Weighting Terms Based On Text Classification
6	The Research On A Term Weight Calculation Method Based On The Term Mathmatical Expection
7	Research On Text Classification Of Web Text Mining
8	Research And Implementation Of KNN Text Classification Based On CURE Clustering
9	Research And Application Of Mobile Phone Users Classfication Method Based On Characteristics Of Text
10	Correlation Algorithm Research And Realization Chinese Text SVM-based Classification