A Web News Extraction Method Based On Filtering Noise Wrapper

Posted on:2018-05-01

Degree:Master

Type:Thesis

Country:China

Candidate:M Sun

Full Text:PDF

GTID:2348330512980204

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The explosive growth of the news web has resulted in accumulating a large amount of news resources on the Intemet.Because of these information resources heterogeneity,and lacking of unified standard,it can't be processedby the traditional database technology,which results these resources information gathering on the Internet can only be used by the search and browsing.In addition,the massive Web news page of the news is also the basis research of public opinion monitoring,topic updates and other research.Based on the traditional regular expression way to extract Web news,it is difficult to adapt to the HTML page structure changes,resulting in a sharp decline of the accuracy.Therefore,the research on Web news extraction technology has very important practical value.In this paper,based on the experimental results of ACME algorithm Web news extraction results,the following research is carried out:(1)In this paper,A web news extraction method based on filtering noise wrapper is proposed.when inducing the wrapper with aligningfirst labels,if the two strings don't match each other,according to the thresholda,we calculate the string tag path ratio of the strings to distinguish purity news from noise;Thus,a good denoising effect can be achieved when the news content is extracted by using the UFRE expression.In a data setsconsisting of large number of real Web news pages and Clean-eval data sets,the compared results of SLPR and RoadRunner extraction technology and NFaS system shows,The SLPR method overcomes the disadvantage of robustness and portability,At the same time,the method has the characteristics of filtering noise,and the average accuracy rate of the news text extraction is 95.9%,which is higher than other extraction techniques.(2)In order to maintain the integrity of news extraction,In this paper,an algorithm for extracting Web headlines and time based on naive Bias classifier is proposed,Through the establishment of a specific data set based on the corpus to extraction of a variety of title and time characteristics of the elements,with fusingthe principle calculation ofthe naive Bias classifier,the headlines and time extraction can be achieved.Based on a large number of real Web news pages,we compare the results with the traditional regular expression extraction method,the average extraction rate is 93,06%,the lowest extraction accuracy is only about 86.80%,Which verifiedthat the method overcomes the weakness ofregular expression who drop easily when web structure changed.The generality and effectiveness of this algorithm in extracting Web news headlines and time are highlighted.(3)A prototype system of Web news target extraction is designed,which integrates two algorithms.The system is divided into five functional modules,the paper describes the operation principle of the module and user interface instructions,in the end,this topic completed the development of the Web news target extraction prototype system.

Keywords/Search Tags:

Information Extraction, Wrapper, Web news pages, ACME algorithm, STPR, Naive Bayes classifier

PDF Full Text Request

Related items

1	The Ensembling Chinese Web Pages Classifier Based On Bayes And Outlinks
2	Design And Implementation Of Multi-classifier Based On Information Classification System
3	Prediction Of Protein Contact Map Based On Weighted Naive Bayes Classifier And Extreme Random Tree
4	The Research Of Multi-layer Hidden Naive Bayes Algorithm Based On Mutual Information
5	Research On Optimization Of Routing Algorithm Based On Semi-Naive Bayes
6	Image Annotation Based On Ensemble Of Naive Bayes Classifier
7	Classification System Based On The Theme Of Information Acquisition In The Pages
8	Research On The Design Of Naive Bayes Classifier Based On Memristor
9	A Text Classifier About High Blood Pressure Based On Naive Bayes
10	Studies On Classifiers Based On Decision Boundaries From The Perspective Of Dividing Data Space