Research On Terms Co-occurrence Based Models And Algorithms For Text Mining

Posted on:2011-09-15

Degree:Doctor

Type:Dissertation

Country:China

Candidate:P Chang

Full Text:PDF

GTID:1118360308454608

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

There has been a phenomenal growth of information during past decades. The work of understanding the massive information has been a hopeless for human-beings. To obtain information automatically from the text information has become a key problem in our information research society. The main research work of this thesis is based on statistical machine learning methods with the usage of co-occurrence, especially the Text Mining models and algorithms. The main contents are as follows:First, a novel model of document is presented which is built with co-occurrence term, named co-occurrence term vector space model (CTVSM). The algorithm of mining associate rules is employed to extract the co-occurrence terms in the document space. Then the document model is defined with these co-occurrence terms and measurement of the similarity between two documents is defined further. Experimental results show that the distance of documents which are less similar is farther than distance in Euclidean space basis of VSM, and the distance of documents are more similar is closer than the one in Euclidean space.Second, on the basis of CTVSM, a novel document clustering algorithm is proposed. In this algorithm the document and cluster are presented by CTVSM and the measurement of different clusters is given according to the measurement of documents. In order to decide the optimal number of clusters, clustering gain as a measure for clustering optimality is advanced. It shows good performance producing intuitively reasonable clustering configurations in document clustering according to the evidence from experimental results.Third, another focus of this thesis is on using CTVSM to cluster large scale terms in document space. A map of co-occurrence terms is defined, in which words are mapped into dots and relationship between the co-occurrence words is mapped into edges. An algorithm of word clustering is proposed based on this map. It joints the word with the cluster on the basis of the change of the cluster's density. It shows that this algorithm is better than the normal word clustering method in both performance and efficiency.Finally, an application of the topic map extracted from the document space is proposed. An algorithm of subject words extraction is improved by using topic map. Topics of a document are identified by means of estimation of statistical topic model. Thus the document's topic term fields are identified. The weight of terms is adjusted according to the topic term fields. Experimental results indicate that the proposed method significantly outperforms methods that combine existing techniques.

Keywords/Search Tags:

Text Topic Mining, Terms Co-occurrence, Document Clustering, Terms Clustering, Keyword Extraction

PDF Full Text Request

Related items

1	Research And Realization Of Text-based Keyword Extraction Method
2	Internet Public Opinion Monitoring And Analysis System To Achieve
3	Information Extraction of cyber security related terms and concepts from unstructured text
4	The Research On Keywords Extraction From Chinese News Web Pages Based On Clustering
5	Extended SBN Retrieval Model Based On Ontology Terms Relationship
6	Research On Efficient Document Clustering Using Improvised Sub-Document Based Framework
7	The Research And Implementation Of Business Text Hot Topic Discovery System Based On Topic Word
8	The Method Of Fine-Grained Topic Information Extraction And Text Clustering Based On Chinese Phrase
9	Research On Keyword Extraction Technology Oriented To Conversational Text
10	Research Of Text Mining And Application In Topic Search