Font Size: a A A

The Research Of Text Clustering And Keywords Extraction Based On Complex Network Theory

Posted on:2012-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:F H XieFull Text:PDF
GTID:2218330335475790Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology, the number of text data is increasing amazingly. How to quickly access the useful text information in large text data, properly manage and use these text messages has become the urgent problem. Getting use of the data mining technology reasonable can efficiently help to solve this problem.Text clustering and text keyword extraction is an important field in text mining research. Text clustering divides the text of document into several clusters, which requires that the texts assigned to each cluster are more similar to each other than the texts assigned to different clusters. As an unsupervised machine learning method, text clustering doesn't require the training set or need to know the number of clusters in advance. It has a great of flexibility and reality. Text keyword extraction is one of the important text information processing technology. It is the premise and foundation of information processing including automatic categorization, automatic clustering, automatic summary generation and so on.This thesis introduced the background of the text mining and text keywords extraction, research significance, research status and relevant theoretical knowledge. This thesis summarized domestic and foreign classics theoretical knowledge, proposed a new text clustering method and a new text keywords extraction. Main work includes the following two aspects:1. Based on partitioning community in complex network a text clustering method is proposed. Firstly, a new algorithm for detecting community structures in a weighted complex network is proposed. To partition the weighted complex network into groups, the algorithm looks for the density sets constantly and some proper operations are executed. Secondly, the proposal is applied to cluster text documents which are represented by the vector space model. A weighted complex network is constructed in terms of the similarity between two documents calculated by the cosine function. And then the community structure in this network is detected by the proposed algorithm. Finally, the experiment results show that the proposed algorithm has a good clustering efficiency by clustering some samples of Reuters-21578 data sets.2. Analyzed the characteristic and disadvantages of the existing keywords extraction algorithm based on complex network, a new keywords extraction algorithms based on weighted complex network is proposed. First of all, a weighted complex network model is constructed according to the relationship between the feature words of text. Secondly, the weighted clustering coefficient and betweenness are introduced to calculate the node's multi-feature value. Finally, the keywords are extracted by the multi-feature value. The experiment results show that the keywords extracted in this algorithm have great contribution to the text subject, and the accuracy of keywords extraction is better than the existing algorithms.
Keywords/Search Tags:text clustering, keywords extraction, weighted complex network, density set, multifeature value
PDF Full Text Request
Related items