Font Size: a A A

Improved Vector Space Model And Its Application To Document Classification System

Posted on:2010-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:L NingFull Text:PDF
GTID:2178360278958929Subject:Computer applications
Abstract/Summary:PDF Full Text Request
Automatic document classification is one of the key technologies in information processing technology, now many applications of information processing are based on automatic document classification, such as search engine, email categorization, electronic meetings, information fitering, and many other aspects.In current automatic document classification technology, the eigenvetor of document does not reflect the semantic information, because of the irregularities of the document, the limitations of the algorithm, and many other issues. In order to solve these problems, many researchers have done a lot of work and achieved many interesting results.In the paper the principle of the algorithm is described, it also simply describe the development, applications and status about document automatic categorization, it mainly focus on the improved vector space model (VSM), and its application in document automatic categorization system. The method makes VSM, paragraph vector and words distance vector together on eigenvector extract. A new Chinese word segmentation algorithm based on probability and search tree is proposed in this paper.The experimental result shows that the improved VSM has the great effect on improving recall rate and precision of the document automatic categorization. It can not only refect the semantic information of the document, but also keep the characteristics of the document in the vector, what is pay to infrared processing. The Chinese word segmentation algorithm has a great effect on word segmentation, and improved the quality of the improved VSM.
Keywords/Search Tags:Vector Space Model, Praagraph Vector, Words Distance Vector, Chinese Word Segmentation
PDF Full Text Request
Related items