Font Size: a A A

Automatic Document Classification Based On Probabilistic Topic Model

Posted on:2016-08-11Degree:MasterType:Thesis
Country:ChinaCandidate:D XuFull Text:PDF
GTID:2308330476952454Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years, Internet develops at an amazing speeding, it contains various kinds of original rough information, including text information, voice information, image information and so on. How to master the most effective information is always a big target in information processing. To classify the document accurately and efficiently becomes a very important part to solve the problem.I by studying the present situation of document classification and method of Chinese word segmentation, designed two kinds of model implementation document classification method based on probability subject, and will develop into software. Can respectively in the presence of supervised and unsupervised, realize the Chinese document classification.Documents for supervision, design realizes the document classification method based on probability subject model. According to a certain amount have to distinguish between good training documents, calculation of each type of document theme distribution probability distribution. By comparing the new document and training document collection probability distribution proximity, so as to decide the classification of the new document.For unsupervised unsupervised document does not contain the document library classified situation. Design has realized based on fuzzy K- Means the probability of the subject document classification method. First to extract the keywords in the document library, and extract a certain number of documents, use these keywords and document extract theme, calculation is extracted from the theme of the document distribution, then the distribution of the remaining documents according to the theme of proximity to these document clustering. Completion of the first cluster, will determine the topic and subject distribution, again and again according to the theme distribution of documents for the second and third clustering, until no longer changes, classification.The software is developed by C # language. Its interface is friendly and run fast.What’s more, it can take the Chinese word, word order, document classification, batch processing, and import and export functions into consideration.
Keywords/Search Tags:Probability theme, Document Classification, K-Means, Chinese segmentations
PDF Full Text Request
Related items