Font Size: a A A

Research Of Chinese Text Classification Based On Mixed Feature

Posted on:2013-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:Z F LinFull Text:PDF
GTID:2298330467478732Subject:Control engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the Information technology and the arrival of we-media era, more and more information exist in the Internet by way of electronic text. To extract accurate and valuable knowledge from the massive Web text information has become a major goal of information processing. As a research hotspot in the field of information processing technology, automatic text classification can process massive text, solve the problem of disordered information commendably. And as the technical basis for the field of information retrieval, information filtering and search engine, automatic text classification technology has broad application prospects.The application background of this page is topic information retrieval in the field of vertical search. To achieve efficient topic classfication is the main tasks of this system. For the more demanding of Web content direct performance in the vertical search, we developed a Chinese text classification based on mixed feature. Solve the weak direct problem of traditional Web text classification results. The research mainly concerned on Web text extraction, mixed feature modeling and classification strategies.The Web text is extracted by an extractor. The ads, image and hyperlinks in the Web pages brought great trouble for Web text classification. This Web text extractor enable the Web page to become more pure which only contains text content.The vector space model is established by mixed feature. The mixed feature consists of term feature and Web feature. Term feature is selected throuth natural language processing and feature dimension reduction, decided with the improved term weight algorithm. The classification performance of the improved term weight algorithm is verified by the corresponding experiments. The Web feature set consists of pages’ linguistic characteristics and network characteristics. We achieve the Web feature modeling through statistics and normalization.The thinking of machine learning is introduced to train classifiers. We study the support vector machine and optimize the parameters in order to reach a better recognition performance of topic classifier and Web filter. A Chinese text classification system is proposed and implemented in this paper. The system cascades topic classifier and Web filter. System firstly fetches the Web resources from the Internet and extracting the text information, then establishes mixed feature set and build the system based on the feature. Finally through the experimental, we verify the system has higher classification accuracy and strong Web page filtration capacity.
Keywords/Search Tags:text classification, term weight algorithm, mixed feature, support vector machine
PDF Full Text Request
Related items