Font Size: a A A

A Web News Extraction Method Based On Filtering Noise Wrapper

Posted on:2018-05-01Degree:MasterType:Thesis
Country:ChinaCandidate:M SunFull Text:PDF
GTID:2348330512980204Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The explosive growth of the news web has resulted in accumulating a large amount of news resources on the Intemet.Because of these information resources heterogeneity,and lacking of unified standard,it can't be processedby the traditional database technology,which results these resources information gathering on the Internet can only be used by the search and browsing.In addition,the massive Web news page of the news is also the basis research of public opinion monitoring,topic updates and other research.Based on the traditional regular expression way to extract Web news,it is difficult to adapt to the HTML page structure changes,resulting in a sharp decline of the accuracy.Therefore,the research on Web news extraction technology has very important practical value.In this paper,based on the experimental results of ACME algorithm Web news extraction results,the following research is carried out:(1)In this paper,A web news extraction method based on filtering noise wrapper is proposed.when inducing the wrapper with aligningfirst labels,if the two strings don't match each other,according to the thresholda,we calculate the string tag path ratio of the strings to distinguish purity news from noise;Thus,a good denoising effect can be achieved when the news content is extracted by using the UFRE expression.In a data setsconsisting of large number of real Web news pages and Clean-eval data sets,the compared results of SLPR and RoadRunner extraction technology and NFaS system shows,The SLPR method overcomes the disadvantage of robustness and portability,At the same time,the method has the characteristics of filtering noise,and the average accuracy rate of the news text extraction is 95.9%,which is higher than other extraction techniques.(2)In order to maintain the integrity of news extraction,In this paper,an algorithm for extracting Web headlines and time based on naive Bias classifier is proposed,Through the establishment of a specific data set based on the corpus to extraction of a variety of title and time characteristics of the elements,with fusingthe principle calculation ofthe naive Bias classifier,the headlines and time extraction can be achieved.Based on a large number of real Web news pages,we compare the results with the traditional regular expression extraction method,the average extraction rate is 93,06%,the lowest extraction accuracy is only about 86.80%,Which verifiedthat the method overcomes the weakness ofregular expression who drop easily when web structure changed.The generality and effectiveness of this algorithm in extracting Web news headlines and time are highlighted.(3)A prototype system of Web news target extraction is designed,which integrates two algorithms.The system is divided into five functional modules,the paper describes the operation principle of the module and user interface instructions,in the end,this topic completed the development of the Web news target extraction prototype system.
Keywords/Search Tags:Information Extraction, Wrapper, Web news pages, ACME algorithm, STPR, Naive Bayes classifier
PDF Full Text Request
Related items