Font Size: a A A

The Design And Implementation Of Internet News Reading System Based On Hadoop

Posted on:2018-05-21Degree:MasterType:Thesis
Country:ChinaCandidate:L M MaFull Text:PDF
GTID:2348330536987942Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the explosive growth of News information in recent years,reading model has changed a lot.Instead of traditional media such as the newspaper and TV,people prefer reading on the internet.However,body contents are usually covered by presentation elements.In the work of Gibson et al.,they estimated that layout presentation elements constitute 40% to 60% of all internet content.What?s worse,there are a large amount of duplicated web pages most of which are from being reprinted.With respect to Chinese websites,the experimental result of “Tianwang” search engine of Beijing University shows that among 430 million Chinese websites,only 68 million of them are not duplicated.Thus,how to extract unduplicated main text with no presentation elements is particularly important to achieve effective reading.Taking mainstream news website of home as study subject,how to extract unduplicated main text automatically was realized in this thesis,and a news page reading system based on Hadoop was designed.The main contributions were as follows:1)With respect to the noise information interference in news web pages,a novel FW-DTSS based approach to extract main text was proposed in this thesis.Through comparative experiments on dozens of news websites,we found that the F-score of the FW-DTSS based approach was higher than the VIP?s and the WPMTE?s in most cases.Its F-score can stay over 96%,with an average rate of over 99%.When dealing with some web pages,it can reach 100%.2)A novel FW-BF based approach to eliminate duplicated web pages was proposed to eliminate duplicated web pages.Through comparative experiments on URL set,the average F-score of the FW-BF based approach to web pages of complete repetition,partial repetition and totally different was over 99%.The F-scores of FW-BF,Bloom Filter and Feature code based approaches were similar with each other.But the running time of the FW-BF based approach was the shortest.The running time of the three algorithms were respectively 44 s,56s and 212 s.3)The approaches of FW-DTSS and of FW-BF were combined in this thesis,and a timely internet news reading system based on Hadoop was designed.Due to the limited amount of daily news,only ten typical mainstream news web pages were selected in the system.This reading system enables users to subscribe to one or more web pages voluntarily.It automatically extracts main texts and eliminates duplicated web pages on mainstream website,and finally presents users with clear headlines and body texts.
Keywords/Search Tags:Hadoop, Web page extraction, Web page elimination, Function words, Bloom Filter
PDF Full Text Request
Related items