Font Size: a A A

Research On Several Models In Text Classification And Clustering

Posted on:2012-07-07Degree:MasterType:Thesis
Country:ChinaCandidate:S Z HeFull Text:PDF
GTID:2218330338468510Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid and continous growth of text data in the Internet, text mining as an effective tool in organizing and managing large mount of text data has been studied intensively and applied widely. Some improved methods aim at solving the problems of text classification and text clustering in the field of text mining have been proposed in the paper.As for the supervised learning problem in text classification, traditional classification methods are good at categorizing the documents into a few categories. However, classification on a large-scale hierarchy is a challenge task for many categories with cross-link relationships."Deep classification"method is an effective framework for the problem and makes the problem tractable, it consists of two stages: search stage and classification stage, the search phase is used to select a number of candidate categories for a given testing document, classification phase is used to fix final category based on a more accurate classifier with those category candidates. We proposed an improved deep classification model, first, a new method to evaluate of the effect of search stage being proposed, second, we select category candidates based on category and document information, at last, we training centroid-based classifier-Rocchio, which utilize the information of related categories, such as top category, parent categories, sibling categories and subclasses.In the field of the unsupervised learning problem in text clustering, it is important to calculate of correlation among documents accurately and efficiently. A common method is to calculation the statistical correlation between the document vectors directly; but it does not take the adjacency of the documents into account. In this paper, we proposed a new method based on Markov Network model, which take not only the direct statistical information but neighborhood information into account of computing its correlation. We build a Markov Network and weighted combine the transfer matrix of each step, which increasing the description of correlation between within-class data and expanding the gap between inter-class data; finally, we clustering documents by the description of correlation whose gap is obvious.Our primary works are as follow.1) An advanced classification model has been proposed after systematically study on the methods and applications for large-scale text classification. A series of experiment show that relevant categories, especially the top-level and sibling categories, have a good rold in determine the target class.2) We represent the text data set based on Markov network model, and describe the correlation of documents by weighted combine the transfer matrix of each step, at last we clustering based on the description. A series of experiment show that the method of weighted combine the transfer matrix of each step can be well improve the clustering effect in text clustering.
Keywords/Search Tags:Text mining, Large-scale text categorization, Deep Classification, Text clustering, Markov Network
PDF Full Text Request
Related items