Font Size: a A A

Web News Extraction Via Tag Path Features Series

Posted on:2015-06-25Degree:MasterType:Thesis
Country:ChinaCandidate:L LiFull Text:PDF
GTID:2308330473459343Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The degree of exposure to Web news have been promoted by the development of the Internet technology, as well as the popularity of mobile devices and the rise of applications such as micrologging and micro-channel. Reading web news in fragmented time has become one of the main activities of Web users. However, in addition to the main content, web news pages also contain lots of "noise", which increases the computation and storage capacity of web applications like Web news aggregation and Web information retrieval, as well as influnces the experience result of pocket-size devices with small screens like mobile phones or PADs. Therefor, web news content extraction has both research and application values.In order to solve the problem of accurate extraction of web news in an open environment, our extensive case studies have indicated that there exists potential relevance between web content layouts and their tag paths. Motivated by this observation, this thesis explores a novel research topic on Web news extraction based on tag path features. The main contributions of this thesis are as follows:(1) Based on the potential relevance between web content layouts and their tag paths, a extraction feature based on tag path:Text to tag Path Ratio (TPR) is designed. With a deep analysis of the deficiency of the TPR feature, it is extended to be a Extend Text to tag Path Ratio (ETPR) feature. A Gaussian smoothing method weighted by tag path edit distances is designed to extract the short text accurately. Extraction results on CleanEval datasets show that the CEPR method is an unsupervised, generic, and efficient Web news content extraction methods.(2) In order to improve the diversity of tag path features, a tag path feature series was proposed by analyzing the relevance of the web content and their tag paths from different perspectives. After analyzing and verifying the advantages and disadvantages of each tag path feature, all tag path features are combined to a comprehensive feature by the DS theory and designed a Web news extraction method based on the comprehensive feature. Experimental results on real datasets demonstrate that the performance of CEPC outperform any one tag path feature and the average performance of CEPC is better than CEPR.(3) To solve the redundancy of tag path features of the CEPC method, a correlation measurement method between tag path features based on the Pearson correlation coefficient is studied, and on this basis, a group features selection strategy is designed. Experimental results show that:CEPF, a Web news extraction method based on the group features selection, can achieve better average performance by 92.75% than other existing methods like CEPC.(4) A Web news extraction system via tag path features is designed and implemented. All the extraction methods and tag path features which are researched in the dissertation are combined in the system. The implementation and the user interface of the system are introduced later.
Keywords/Search Tags:Information Extraction, Web News, Tag Path Feature, DS Theory, Group Features Selection
PDF Full Text Request
Related items