Font Size: a A A

BBS Network Hot Topic Discovery

Posted on:2015-03-09Degree:MasterType:Thesis
Country:ChinaCandidate:G D MaFull Text:PDF
GTID:2268330428481042Subject:Education Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet information technology, Internet resources are increasingly complex, and the large amounts of data are not being fully utilized. The Internet resources are built up with HTML, using text detecting techniques can access to network information resources efficiently. As an important part of the text detecting technology, text clustering algorithm is a hot point.In this paper, the data sets come from The BBS of "China Web". I carried out a study on two classic text clustering algorithm:Single-Pass and K-means. I analyzed the existing deficiencies of the two algorithms, and improve them. Finally I got a data results to prove the two improved algorithm is reliable. The main work is as follows:1. Describing the acquisition process of the BBS text data (tree structure and table structure) and introducing the process of selecting the feature items, including the process of data cleaning and textual representation.2. Introducing the feature of classic clustering algorithm Single-Pass of that "set a unique cluster centroid". And I improved the program for the algorithm of "cluster centroid is not the only ", reducing the time complexity of the algorithm. I also improved the algorithm defect of "algorithmic randomness strong cluster centers ". Finally I got a data results to prove the improved algorithm is reliable.3. Introducing the feature of classic clustering algorithm K-means of that "cluster centers is difficult to determine ". I improved the program for the algorithm of "selecting the optimal clustering centroid", the improved algorithm has been a reliable the cluster centroid. Finally I got a data results to prove the improved algorithm is reliable.4. Setting a "recovery class" in Single-Pass and K-means for storing the BBS text useless. And after analyzing the "recovery class", I proved the change which the "other topics paste" evolved into "hot posts" possible in the future is important.
Keywords/Search Tags:web data detection, topic discovery, Single-Pass, K-means
PDF Full Text Request
Related items