Font Size: a A A

Research And Application Of Internet Hot Topic Detection Based On Big Data Background

Posted on:2017-07-22Degree:MasterType:Thesis
Country:ChinaCandidate:Z F ZhaoFull Text:PDF
GTID:2348330503968504Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet media technology, people getting on Internet for information like news, blogs, and microblogs has become easier and easier, which make a great convenience to know what happen around. News sources has changed from newspapers and magazines, television and radio to the online news portal, blog sites and microblogging and ways of getting the news information has changed from listening to broadcast and watching TV news report in a daily fix time in the old days to instant subscription, anytime, anywhere nowadays. However, with hundreds of news and blogs sites producing tens of thousands of pieces of information every day, not only the time consumption to this massive information of readers are getting higher and higher, but also the work of the site editors to collect and edit this report for digging out the real-time screening and hot topics become a mission impossible. At the same time, since entering the Web2.0 era, along with growth and the outbreak of the Internet information scale, traditional technologies to deal with and store massive unstructured text data produced by the Internet in daily production has become increasingly difficult to meet the performance requirements of practical application. Therefore, based on the background of this “Big Data” era, to design and implement a complete set of solutions that can handler tasks of real-time detection and automatic discovery and tracking of varieties of Internet hot topic day by day has an important significance. For the above analysis, combined with the Internet hot topic and “Big Data”, this paper completed the following works:Firstly, design and impelement a web crawler that can incrementally getting the data of the new release of information for news portal, blog sites and microblogging platforms in real-time. In the work, we propose a main text extraction algorithm to detect the main text part of an Internet new or blog on a web page. Also, by dealing with the increasing data, we propose a project based on the Java NIO to expand out crawler to the distributed nodes.Secondly, we propose a text representing model of report or hot topic based on the key terms and named entity. Also we propose a clustering algorithm based on the naive Bayes classifier and 3 Layer Single-Pass clustering algorithm to finish tasks of real time topic detection and tracking in different fields. Meanwhile, according to changes in the population and contents of a topic by time go through, we develop a method to tracking this changing.Thirdly, by use of the Hadoop cloud platform, we develop a program for batch processing and storing vast amounts of non-structure data based on the parallel computing framework Map-Reduce and the distributed NoSQL database HBase to dig out the daily hot topic in “Big Data” eraLast of all, we complete a web site platform to monitor Internet topic and display information of the newest break out topic with using the visualization tools to visualize the statistics data of a topic from the different dimentions.
Keywords/Search Tags:Web crawler, Na?ve Bayes Classifier, Hot Topic Detection and Tracking, Map Reduce, Hadoop Cloud Platform
PDF Full Text Request
Related items