Font Size: a A A

Research On Terms Co-occurrence Based Models And Algorithms For Text Mining

Posted on:2011-09-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:P ChangFull Text:PDF
GTID:1118360308454608Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
There has been a phenomenal growth of information during past decades. The work of understanding the massive information has been a hopeless for human-beings. To obtain information automatically from the text information has become a key problem in our information research society. The main research work of this thesis is based on statistical machine learning methods with the usage of co-occurrence, especially the Text Mining models and algorithms. The main contents are as follows:First, a novel model of document is presented which is built with co-occurrence term, named co-occurrence term vector space model (CTVSM). The algorithm of mining associate rules is employed to extract the co-occurrence terms in the document space. Then the document model is defined with these co-occurrence terms and measurement of the similarity between two documents is defined further. Experimental results show that the distance of documents which are less similar is farther than distance in Euclidean space basis of VSM, and the distance of documents are more similar is closer than the one in Euclidean space.Second, on the basis of CTVSM, a novel document clustering algorithm is proposed. In this algorithm the document and cluster are presented by CTVSM and the measurement of different clusters is given according to the measurement of documents. In order to decide the optimal number of clusters, clustering gain as a measure for clustering optimality is advanced. It shows good performance producing intuitively reasonable clustering configurations in document clustering according to the evidence from experimental results.Third, another focus of this thesis is on using CTVSM to cluster large scale terms in document space. A map of co-occurrence terms is defined, in which words are mapped into dots and relationship between the co-occurrence words is mapped into edges. An algorithm of word clustering is proposed based on this map. It joints the word with the cluster on the basis of the change of the cluster's density. It shows that this algorithm is better than the normal word clustering method in both performance and efficiency.Finally, an application of the topic map extracted from the document space is proposed. An algorithm of subject words extraction is improved by using topic map. Topics of a document are identified by means of estimation of statistical topic model. Thus the document's topic term fields are identified. The weight of terms is adjusted according to the topic term fields. Experimental results indicate that the proposed method significantly outperforms methods that combine existing techniques.
Keywords/Search Tags:Text Topic Mining, Terms Co-occurrence, Document Clustering, Terms Clustering, Keyword Extraction
PDF Full Text Request
Related items