Research On Several Models In Text Classification And Clustering

Posted on:2012-07-07

Degree:Master

Type:Thesis

Country:China

Candidate:S Z He

Full Text:PDF

GTID:2218330338468510

Subject:Computer Science and Technology

Abstract/Summary:

With the rapid and continous growth of text data in the Internet, text mining as an effective tool in organizing and managing large mount of text data has been studied intensively and applied widely. Some improved methods aim at solving the problems of text classification and text clustering in the field of text mining have been proposed in the paper.As for the supervised learning problem in text classification, traditional classification methods are good at categorizing the documents into a few categories. However, classification on a large-scale hierarchy is a challenge task for many categories with cross-link relationships."Deep classification"method is an effective framework for the problem and makes the problem tractable, it consists of two stages: search stage and classification stage, the search phase is used to select a number of candidate categories for a given testing document, classification phase is used to fix final category based on a more accurate classifier with those category candidates. We proposed an improved deep classification model, first, a new method to evaluate of the effect of search stage being proposed, second, we select category candidates based on category and document information, at last, we training centroid-based classifier-Rocchio, which utilize the information of related categories, such as top category, parent categories, sibling categories and subclasses.In the field of the unsupervised learning problem in text clustering, it is important to calculate of correlation among documents accurately and efficiently. A common method is to calculation the statistical correlation between the document vectors directly; but it does not take the adjacency of the documents into account. In this paper, we proposed a new method based on Markov Network model, which take not only the direct statistical information but neighborhood information into account of computing its correlation. We build a Markov Network and weighted combine the transfer matrix of each step, which increasing the description of correlation between within-class data and expanding the gap between inter-class data; finally, we clustering documents by the description of correlation whose gap is obvious.Our primary works are as follow.1) An advanced classification model has been proposed after systematically study on the methods and applications for large-scale text classification. A series of experiment show that relevant categories, especially the top-level and sibling categories, have a good rold in determine the target class.2) We represent the text data set based on Markov network model, and describe the correlation of documents by weighted combine the transfer matrix of each step, at last we clustering based on the description. A series of experiment show that the method of weighted combine the transfer matrix of each step can be well improve the clustering effect in text clustering.

Keywords/Search Tags:

Text mining, Large-scale text categorization, Deep Classification, Text clustering, Markov Network

Related items

1	Research On Key Problems In Text Mining
2	Research On Large Scale Hierarchical Classification For Internet Text
3	Research On Key Problems About Large-Scale Text Clustering
4	Research On News Classification And Clustering Based On Text Mining
5	Chinese Text Mining
6	Research On The Application Of Text Classification And Clustering In Network Secutiry Operation System
7	Design And Implementation Of Large-scale Short Text Classification System
8	Research On Key Problems In Text Mining Based On Cloud Method
9	Research And Realization Of Clustering Guided Web Chinese Text Classification Based On SVM
10	Design And Implementation Of Commodity Classification Management And Retrieval System Based On Large-scale Short Text Classification