Font Size: a A A

Construct High Performance Text Clustering Systems Based On Map-Reduce

Posted on:2012-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:J J CengFull Text:PDF
GTID:2178330338984190Subject:Content security
Abstract/Summary:PDF Full Text Request
According to the"2009 China Internet public opinion analysis report", 23 of 77 influential social hot incidents broke via the Internet and aroused widespread concern in 2009. Some of these events, such as the case of Deng Yujiao, the fishing law enforcement of Shanghai traffic management department and the Hangzhou drag racing event, have a very bad social influence. The lack of internet supervision will inevitably lead to the flood of unhealthy and reactionary information, and adverse public opinion to mislead the public, so that the government will lose credibility. It will certainly be a threat to social harmony and stability. Monitoring the hot issues on the Internet can help the national department to make an effective response to the issue, ease the pressure of public opinion, and enhance the credibility of the government. Therefore, it has a very important social value and practical significance.Internet hotpots discovery and monitoring is really essential for enhancing the credibility of the government, while text clustering is widely used in these areas. However, huge datasets is becoming more and more prevalent in current internet environment. Reports shows that Google processes over 20 petabytes of data per day, and this figure is increasing rapidly. As a result, a high-performance and scalable distributed system is desired to deal with such huge datasets.This paper proposes Map-Reduce, a powerful distributed computational method, to be used in text clustering, and construct a distributed text clustering system based on Hadoop, which is an open-source implementation of Map-Reduce. Finally, a system test is employed to tuning the system performance and clustering accuracy, and also to verify the system is much more scalable than the general text clustering system. The demonstration of its validity provides a new way to construct text clustering system with high performance and scalability.
Keywords/Search Tags:Text clustering, Map-Reduce, Distribute computing, Chinese word segmentation, K-means algorithm
PDF Full Text Request
Related items