Font Size: a A A

Research And Application Of Internet Chinese Text Classification

Posted on:2012-12-16Degree:MasterType:Thesis
Country:ChinaCandidate:C ChenFull Text:PDF
GTID:2178330335460429Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The thesis studies the related technologies of the Internet Chinese text classification, and CHI algorithm is improved and optimized to achieve better results and realize the Internet Chinese text categorization system. The thesis has completed the following tasks:(1) Internet text information acquisition subsystem was developed based on Heritrix platform. The sub-system makes customized changes to the Heritrix crawler. And it achieves real time, continuous automatic collection of the Internet text data.(2) Text analysis module is based on text block algorithm. It can effectively eliminate large number of useless information on the page. The text is re-organized useing the format of title, source, author and the publishing time and the body of the text.(3) Text classification sub-system is implemented. Text classification includes four main steps:text segmentation, disable word processing, feature extraction and classifier learning. Term Trequency, Document Frequency and Entropy Calculation are selected to achieve disable word processing; Document Frequency, Chi-square Statist and Information Gain are selected to achieve feature words selected; Bayesian, Decision Tree, and k Nearest Neighbor are selected to achieve classifier.(4) Improved the classification sub-system. By contrast test, the advantages and disadvantages of various algorithms were compared. The thesis identified the best algorithm combination for text classification. And CHI algorithm was improved to effectively improve the accuracy of text classification.The combination of text classification and web crawler technology achieves the automatic classification of the Internet Information. The system has certain theory significance and application value.
Keywords/Search Tags:Web crawler, text classification, Chinese word segmentation, disable word processing, feature word extraction
PDF Full Text Request
Related items