Font Size: a A A

Research And Realization Of Network Consensus Monitor System Based On The Incremental Text Mining

Posted on:2011-10-23Degree:MasterType:Thesis
Country:ChinaCandidate:L F WangFull Text:PDF
GTID:2178330332485828Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet makes it easy for people can share their thoughts, emotions and attitudes through the blog, forums, and links. The information can be spread rapidly, and a topic likely to have been red in a short time, and reproduced a lot. This includes the suggestions on admonition government, and malicious libel from some reactionary organizations, so that the government departments need to gain the hot topics from the Internet which are talked by the Internet users in a period of time. And the valuable content will be extracted for the government taking corresponding measures guide the network consensus reasonably. Thus, to discovery and monitor the network consensus effectively become very important.Generally, network consensus monitor includes four parts:data collection, text preprocessing, text mining, and the results presentation. The text mining is mainly responsible for automatic discovery of new hot topic, it's the core of the system components. The text clustering is widely used in discovery of new hot topic. The number of the Internet information is large, and if after each data set arrival, clustering the entire data will needs quite a long time. Incremental text clustering is only to cluster the new data collected, which can effectively save time, so it has widely used in network consensus monitor system.The multi-representation index tree clustering algorithm (MRITC) and multi-representation index tree incremental clustering algorithm (MRITIC) is proposed, after having studied many text clustering and incremental clustering algorithm. By analyzing the experimental results, the algorithm got a higher precision and detection rate of new events. In this paper, our work is as follows:1. With the dynamic index tree clustering algorithm and multi-representation theory, Multi-Representation Indexing Tree Clustering (MRITC) algorithm is proposed. This algorithm clustering results are showed as a multi-tree, in which the leaf nodes represents the documents, and the non-leaf node represents the class of cluster. For each new document, the algorithm finds the most similar leaf nodes in the tree, tries to insert a node for the new document into the tree along path from the tree root down to the most similar leaf node, until the best suited position is found. And each node chooses the k nodes which can represent its shape, so that it prevents from the cluster decentralized and sensitive on the data input order. The experimental result showed that the algorithm has higher accuracy and better clustering results than the original algorithm.2. Based on MRITC in Chapter 3, Multi-Representation Index Tree Incremental Cluster (MRITIC) is proposed. Firstly, a new set of documents is clustered into a new tree based on MRITC. And then, the nodes in the original clustering tree, except the root node, are merged into the new tree. Lastly, the outliers are re-inserted into the index tree by using MRITC. In the merger process, the corresponding classes and documents will be re-classified for the clusters which have been classified based on the similarity among the class clusters and the similarity between the documents with the class clusters. The experimental results show that the algorithm has high accuracy in the new event detection rate.3. A generic clustering framework was designed and implemented, which does not depend on the type of data sets which will be clustering, and it can generate a unified model of clustering results. This makes the framework has good scalability and practicality. For the establishment of the text feature vector space model:firstly, generates index files for the documents by Lucene, and then,framework reads the index files for entries, documents, word frequency information and establishes the characteristics of each text vector space model. Because the Lucene is not very effectively in the Chinese word segmentation, the framework uses Tianjin Hailiang libarary to make it.4. Designed and implemented a prototype system to monitor Network Consensus, which is running in.Net 4.0. This system integrates the data acquisition module, the text pre-processing module, WEB data mining modules, and results presentation and other functions.
Keywords/Search Tags:Network consensus, Incremental Clustering, Dynamic Indexing, Multi-Representation Indexing Tree(MRIT), Multi-Representation Indexing Tree Incremental Clustering(MRITIC)
PDF Full Text Request
Related items