Font Size: a A A

The Research And Implementation Of Text Categorization Technology In Integrated Risk Meta-Search Engine

Posted on:2009-11-05Degree:MasterType:Thesis
Country:ChinaCandidate:F HouFull Text:PDF
GTID:2178360242488526Subject:Computer applications and technology
Abstract/Summary:PDF Full Text Request
With the extensive use of computer and the rapid development of Internet, a large amount of information is communicated in the form of electronic documents. As a result of that, the study of text categorization technique has become one of the hottest research topics in the field of information management and text mining. In this dissertation, lots of exploratory research work has been done on some key techniques of TC, which include text preprocessing, text term selection, text categorization algorithm.The research work can be organized in the following aspects.(1) A new term weighted method is presented. The defects of Term Frequency and Inversed Document Frequency on describing term's importance are analyzed. And then, in order to solve the fact that the defect method can't perform the importance of term well, this dissertation proposes a new term weighted method. Experimental results show that this method can perform more effectively.(2) Term dimensionality reduction is an indispensable step in TC. Traditional Term dimensionality reduction method does not calculate the Statistics value by categories; just select the term which has Strong distinction of the whole categories. This dissertation proposes a new term dimensionality reduction method.Experimental results show that this method can reduce the number of terms sharply, and select the terms more effectively.(3) Raised a combined algorithm which combined BAGGING algorithm and KNN algorithm. Traditional text categorization method can't perform well both in categories speed and categories generalization. The combine method proposed in this dissertation, balance the two sides, and get a better classification effects.(4) Integrated risk meta-search engine system is implemented. The system is based on TC. Provides more abundant information, has a good scalability,and achieves a good prformance.
Keywords/Search Tags:Text Categorization, Term Weighting Computation, Term Selection, KNN, BAGGING Classification Algorithm
PDF Full Text Request
Related items