Font Size: a A A

Design And Realization Of Automated Text Categorization System For Chinese Documents Based On Relevancy

Posted on:2007-12-23Degree:MasterType:Thesis
Country:ChinaCandidate:X X ShangFull Text:PDF
GTID:2178360212475749Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the expeditious development of the internet technique, network information expands continuously. To provide effective and exact information service, many people commence to study the data mining and knowledge detection technology academic containing the text index, information obtainment, information percolation, information management etc. The text categorization technology is one of the importand studying content in data mining and knowledge detection technology academic, it plays the very important role in information automatization.However, the automated text categorization system for Chinese documents is juvenility in theory and practice, categorization process exists many problems. Such as, people adopt the character matching directly in the categorization process while the methods based on code matching is seldom; In addition people adopt the frequence statistic measures directly in the text feature selection process while it is fresh to consider the association ship between the documents. Moreover, the testing transact of the categorization results is less, there are not systematic testing means. These questions make the effect of the automated text categorization system for Chinese documents is not ideal, it will be far from the practicality and commercial practice.This paper presents a automated text categorization system for Chinese documents on the foundation of the existing fruits and our thorough study, the main fruits is below:1. Based on the HASH technology, the paper presents a kind of High Effect Reversed Max Match categorization Arithmetic: We establish Chinese words library grounded on the HASH function, utilize the HASH length of the vocabulary entries to put up the vocabulary entries matching, This measures above can quicken the speed of the vocabulary entries and promote the efficiency the participle.2. Present a kind of Text Feature Selection Method Based on the Optimization Tree of Association Term: we do a "term level associated mining"for the documents, find out the associated relationship and set up associated gather groups, then build a optimization tree of association term using the structure feature of a associated gather tree for each associated level term, afterward select the maximal frequency probability value of the maximal documents probability as the text feature value, Such way reduces effectively the dimensionality of the documents and the text feature value selected is more representative..3. Import the Bayes categorization algorithmic distinguish theory and carry throung F-verification for disposal results. We traverse the Bayes categorization arithmetic, meanwhile conclude the differentia of the text genus based on Bayes theory, then carry throung...
Keywords/Search Tags:text automatic categorization, Chinese vocabulary entry syncopate, text feature selection, associate analysis, Bayes distinguish standard
PDF Full Text Request
Related items