Font Size: a A A

The Research Of Text Classification Based On Improved Term Weighting Method

Posted on:2011-02-21Degree:MasterType:Thesis
Country:ChinaCandidate:P LiFull Text:PDF
GTID:2178360305989235Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Information technology booming developmental today, Internet technology has also been developing rapidly. The number of electronic documents presents exponential growth. Vast amounts of information provides users with convenience, but it causes inconvenience for users to obtain useful information.How to search, organize and manage information effectively for users and search out useful information for users quickly and accurately have become a very important issue at present. In this background, text classification based on machine learning is becoming an important research field increasingly.There have been extensive studies and rapid progresses in text categorization, which is one of the hotspots and key techniques in the data mining and information retrieval field and it has witnessed a booming interest in the recent decades. Text categorization systems can classify texts according to the text contents in a given classification model, in order to better help users organize and mine text information, therefore, it is becoming one of the most important research aspect in information processing field, and has very great developmental potential. Text classification has great practical value , it has an extensive applications in the information retrieval and information filtering, and increases information efficiency greatly.In this thesis, the focus of research is improving the accuracy of text categorization by improving the traditional term weighting methods. Having thoroughly researched the existing traditional text classification based on term weighting methods, we improve a traditional term weighting method—tf-idf method, and obtain a new term weighting method. Traditional algorithm of term weighting only considers about tf(term frequency), idf(inverse document frequency) and so on, and this approach simply thinks low frequency terms are important, high frequency terms are unimportant, so it designs higher weights to the rare terms frequently. To compensate for this deficiency, we present a new term weighting approach to improve the efficiency and accuracy of classification. And the experimental results prove that the new approach can improve the feasibility and efficiency.The experimental results prove that the improved term weighting approach are superior to the traditional term weighting method using KNN classifier to classify over widely-used benchmark data set Reuters-21578 from precision,recall and F1 function.
Keywords/Search Tags:text classification, tf-idf, term weighting, KNN
PDF Full Text Request
Related items