Font Size: a A A

Extracting Web News Using Tag Path Features

Posted on:2013-03-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:G Q WuFull Text:PDF
GTID:1268330398975895Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Web news extraction plays an important role in intelligent Web information processing. It settles a foundation for research and development in information acquisition, information security, Internet sentiment monitoring, personalized recommendation for mobile users, integration of heterogeneous Web data sources, information retrieval, and search engines. Therefore, key issues of Web news extraction have both research and application values.Many Web news sites have similar structures and layout styles. Our extensive case studies have indicated that there exists potential relevance between Web content layouts and tag path patterns on the parsing trees. The traditional path expression is too rigid to adapt to slight changes of HTML structures, which affects the accuracy of information extraction. In addition, massive and heterogeneous Web news data brings a challenge to the wrappers based on handcrafted or rule-based learning. Motivated by these observations, this dissertation explores a novel research topic on Web news extraction using tag path features. Our research consists of two components. For specific websites, we focus on highly accurate Web news extraction based on tag path patterns. For an open environment, we put forward a generic Web news extraction model using tag path features.The main contributions of this dissertation are as follows:(1) Based on potential relevance between Web content layouts and tag path patterns on parsing trees, we propose a novel Web news extraction model PP-WNE, which uses tag path patterns as the extraction knowledge. Based on this model, a special tag path pattern-the distinguishing tag path pattern-which is adapted to Web news extraction is defined, and a distinguishing tag path pattern mining method is designed to construct the extraction knowledge base. Experimental results show that the Web new extraction method using tag path patterns can achieve better performance with an F-score more than98%on real-world datasets. These datasets are randomly selected from Chinese and English Web news sites. These results validate the feasibility and effectiveness of our Web news extracting method using tag path pattern;(2) To optimize the scale of the knowledge base in PP-WNE, we propose a distinguishing tag-path-pattern covering problem, which is proved to be a NP-complete problem. To obtain a near-optimal solution of the distinguishing tag-path-pattern covering problem, a special distinguishing tag path pattern-the minimal distinguishing tag path pattern is defined. A polynomial-time (ln|n|+1)-approximation algorithm, MPM, is designed, where n is the scale of positive samples. Experimental results show that the MPM algorithm can optimize the scale of the distinguishing tag path patterns, and meanwhile, it can also achieve better performance with precision, recall and F-score all above98%on real-world datasets by both node-level and text-level evaluation criteria; (3) To meet the requirements of Web news extraction in an open environment, we design a TTPR feature (Text to Tag Path Ration feature), and describe the calculation process of the TTPR feature by traversing the parser tree of a web page. A threshold method CEPR, which can solve the on-line Web news extraction problem effectively, is designed to distinguish the content from the non-content by the histogram of TTPR. With the combination of a Gaussian smoothing method weighted by the tag path edit distances, the ability of CEPR in extracting short text is improved significantly. CEPR is a Web news extraction algorithm with the merits of a fast, general and no-training process. It can extract Web pages across multi-resources, multi-styles, and multi-languages. The experimental results on the CleanEval datasets show that CERP outperforms CETR and other start-of-art extraction methods in most cases;(4) An HTML Web News Filtering and Summarization system (NFaS) is designed and implemented. In this system, a Web page identification method is proposed by using URL features, structural features, and content features. This method can solve the automatic identification problem of Web news effectively. Furthermore, Web news extraction is used to accomplish the task of Web news filtering. Finally, lexical chains are used to represent semantic relations for summarizing the Web news by extracting keywords with high quality. The effectiveness of NFaS has also been evaluated on real-world datasets.
Keywords/Search Tags:Information Extraction, Web News, Distinguishing Path Pattern Mining, Tag Path Feature, NP-Complete Problem, On-Line Extraction
PDF Full Text Request
Related items