Font Size: a A A

News Topic Mining Based On Web Crawler And LDA

Posted on:2019-07-25Degree:MasterType:Thesis
Country:ChinaCandidate:M Y CaoFull Text:PDF
GTID:2428330566965492Subject:Master of Engineering - Software Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development and population of Internet,network media has exceeded the traditional media,becoming the main media for the public to get information resource.Especially,lots of news websites publish the news on current events in real time,making the news spread fast and widely.People can get the latest news information through network.But the news webpages of different sites distribute all over on World Wide Web,cannot be organized and managed uniformly and are presented to users in the form of webpage of news events.This causes the Internet news have the shortcomings of large in amount,various sources and ununiformed format.People can't get valuable and interesting information effectively from the massive resources.So it is necessary to collect,clean and analyse the news on the Internet and to present to users in a friendly way.The two most important steps of Internet data mining are data acquisition and data analysis.We studied the Internet news topic mining.The main work include four parts as follows:1.We designed and implemented a distributed news crawler based on Hadoop platform,making full use of the MapReduce framework's ability of computing concurrently.The crawler can crawl webpages of several designated websites concurrently and extract news title,pubdate and content from webpages and finally store them into the HDFS in a uniformed format,making it easy for news topic mining later.2.On the basis of research of LDA model and Gibbs samping,we implemented the fundamentary framework of LDA.Then,we applied LDA model to the documents and used Gibbs sampling to estimate the parameters of the multinomial distribution.We extracted the key words of news topics and their relevant news according to the probabilites of documents generating topics and topics generating words.3.We made use of the trained LDA model to inference the topic probabilities of the unseen news and classified it into its associated topic.4.To verify the practicability of the above methods,we designed a news topic mining system to present the news topic information to the users by means of web pages,making users obtain news topic information and query topic-related news easily.The experiments show that the method can detect the hot topics and related news.
Keywords/Search Tags:Topic Mining, LDA Model, Gibbs Sampling, Web Crawler, Hadoop
PDF Full Text Request
Related items