Automatic Document Classification Based On Probabilistic Topic Model

Posted on:2016-08-11

Degree:Master

Type:Thesis

Country:China

Candidate:D Xu

Full Text:PDF

GTID:2308330476952454

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In recent years, Internet develops at an amazing speeding, it contains various kinds of original rough information, including text information, voice information, image information and so on. How to master the most effective information is always a big target in information processing. To classify the document accurately and efficiently becomes a very important part to solve the problem.I by studying the present situation of document classification and method of Chinese word segmentation, designed two kinds of model implementation document classification method based on probability subject, and will develop into software. Can respectively in the presence of supervised and unsupervised, realize the Chinese document classification.Documents for supervision, design realizes the document classification method based on probability subject model. According to a certain amount have to distinguish between good training documents, calculation of each type of document theme distribution probability distribution. By comparing the new document and training document collection probability distribution proximity, so as to decide the classification of the new document.For unsupervised unsupervised document does not contain the document library classified situation. Design has realized based on fuzzy K- Means the probability of the subject document classification method. First to extract the keywords in the document library, and extract a certain number of documents, use these keywords and document extract theme, calculation is extracted from the theme of the document distribution, then the distribution of the remaining documents according to the theme of proximity to these document clustering. Completion of the first cluster, will determine the topic and subject distribution, again and again according to the theme distribution of documents for the second and third clustering, until no longer changes, classification.The software is developed by C # language. Its interface is friendly and run fast.What’s more, it can take the Chinese word, word order, document classification, batch processing, and import and export functions into consideration.

Keywords/Search Tags:

Probability theme, Document Classification, K-Means, Chinese segmentations

PDF Full Text Request

Related items

1	Based On The Theme By The Chinese Single-document Summarization System
2	Based On K-means The Chinese Text Clustering Algorithm
3	The Research Of Chinese Document Classification Algorithm
4	Construction Of Chinese Theme-Rheme Annotation Corpus And Study Of Automatic Analysis Of Chinese Theme-Rheme Structure
5	Research On Chinese Information Classification Based On Improved Bayesian Algorithms
6	Automatic Classification Research On Chinese Web Document Orientation
7	Research On Scene Design In Chinese Xianxia Theme Games
8	Research On "Going Global" Of Chinese Theme Books
9	A Research On Large Scale Automatic Chinese Webpages Classification
10	Web Document Automatic Classification Based On Keywords