Font Size: a A A

Research On Hot Topic Discovery And Evaluation Method Based On MapReduce

Posted on:2015-09-02Degree:MasterType:Thesis
Country:ChinaCandidate:Z TanFull Text:PDF
GTID:2348330509960627Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the approaching of the information age, the scale of information in the Internet shows an explosive trend. People require a higher standard of the information processing. The more, faster, and more accurate information processing means the ability to grasp opportunities and to be able to get more valuable information that then produces greater economic benefit. Topic discovery and evolution as one of the main application of incremental text clustering, is expected to be widely used. In the background of big data, the existing topic discovery and evolution models show a lot of problems such as a poor processing efficiency and poor processing results, which has profound effect on the real-time access to topic discovery, making the quality of information processing far below the needs of big data.To begin with, we introduce the key technology processes of the hot topic discovery, and analyze the advantages and disadvantages of the existing topic discovery and evolution algorithms. The incremental text clustering algorithm, Single-Pass, is chosen to discover hot topics. In order to address the problem of the exponential increase of time spending by Single-Pass as the increase of the text, a MapReduce-based Single-Pass algorithm is proposed, and proves the feasibility of distributed Single-Pass algorithm. The experimental results show the considerable improvement of the algorithm efficiency. Meanwhile, to overcome the problem that the simple distributed Single-Pass cannot detail the inner-structure and the evaluation process of the topics, we propose a hierarchical distributed Single-Pass that improves the description ability of topics. The experimental results demonstrate the improvement of the accuracy of the algorithm.Then, we analyze the characteristics of existing web page text based on which a method to calculate the heat of topics is proposed. By introducing the concepts of the attenuation index and the time slice, a dynamic evaluation method consisting of the incremental text clustering and the attenuation index is proposed. The total heat of topics is the sum of the heat in different time slices that is calculated by the topic heat calculation equation. By introducing the concepts of the topic-heat threshold and the keyword-heat threshold, the feasibility and efficiency of the algorithm is improved, which is validated by the experiments.Finally, the efficiency of the MapReduce-based topic discovery and evaluation method is tested by the actual cases. Experiments prove that the proposed algorithm improves the efficiency and accuracy of clustering, and the topic evolution algorithm combining the incremental text clustering and the attenuation index is able to detail the evolutionary process. As a result, the Map Red uce-based topic discovery and evaluation analysis method is abundant of research value.
Keywords/Search Tags:Big Data, MapReduce, Topic Discovery, Topic evaluation, Distributed Single-Pass Algorithm
PDF Full Text Request
Related items