Font Size: a A A

The Research Of Chinese Document Classification Algorithm

Posted on:2005-08-11Degree:MasterType:Thesis
Country:ChinaCandidate:B ZhangFull Text:PDF
GTID:2168360125956309Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
In recent years , information processing turns more and more important for us to get useful information . Text Categorization, the automated assigning of natural language texts to predefined categories based on their contents, is a task of increasing importance.This paper proposes a new automatic natural language document categorization module based on concept.This module gets the Vector Space feature model by calculating the mutual information of words and types.Then intelligent Chinese word segmentation system based on syntax understanding helps us get the TF-IDF description in VSM of the testing document. How-Net is taken as the main source of knowledge to get the word similarity of the words.The word similarity is taken to weight the document vector features.After being translated to the vectors,the training documents are learned by the SVMS and the support vector is got to classilfy.Then we can classify the testing document after translating the document to vector features.Based on the module,our objective is to design a document classification system.to perform high recall rate and precision rate,low CPU cost and high operation speed.With the experience on the corpus of Fudan University and People's daily, the application on document classification shows a successful example using this algorithm.This paper puts forward new ideas in two ways. 1.The word similarity based on How-net is used to calculate the vector features from the document,then the features can reflect more content of the document;2.DSMA(Difference-Similitude Matrix Algorithm)is used to reducing properties and retrieving formula information in the document information system.With DSM we can get more type features when the testing document numbers grows.
Keywords/Search Tags:Document classification, How-net, Word similarity, Support vector machine
PDF Full Text Request
Related items