Font Size: a A A

Research On The Online Extraction Method Of Web News Publication Time

Posted on:2019-04-22Degree:MasterType:Thesis
Country:ChinaCandidate:L L WangFull Text:PDF
GTID:2428330548991221Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the process of Web search,the publication time of Web page plays an important role,because the return result is time-based in general.Besides,it is also used to locate the occurrence time of news event and further track the event evolution.However,multi-sourced,massive and heterogeneous characteristics of Web news pages make the formats of publication time varied.Furthermore,the news page usually contains other temporal information,such as temporal information in the text of the Web page and related recommendations.Therefore,the extraction of Web news publication time has important research significance and application value.The case analysis proves that the distribution of publication time has a potential association with the text nodes in the corresponding DOM parsing tree and the URL address of the news Web page.Based on the above observation,this dissertation explores the issue of online extraction of Web news publication time using the information of Web news page URL and text nodes on the DOM tree.The contents of the study are as follows:(1)According to the two important clues:one is that temporal information is generally hidden in the Web news URL,which is exactly also the publication time and the other is that the publication time is the content of one of all the text nodes belonging to the DOM parsing tree of the corresponding HTML document of the Web page,an online Web news publication time extraction method based on rules is designed.Firstly,in order to distinguish the temporal node from other non-temporal nodes,massive Web news page instances have been statistically analyzed,and then the characteristics and differences between them are uncovered,used as rules and restrictive conditions to extract the temporal node from all the text nodes;Secondly,in order to extract the temporal information from the URL and the temporal node,corresponding regular expressions of Web news publication time are constructed.The experimental results show that this method is an efficient method for online publication time extraction of Web news.(2)In view of the fact that the temporal information extracted from the URL is not accurate enough(can only be accurate to date),in order to improve the accuracy of extraction,an online publication time extraction method via text node feature fusion for Web news is designed and achieved.The method concentrates extraction target only on text nodes.Various features of temporal nodes and non-temporal nodes are deeply excavated and analyzed,and text node feature series are constructed.Then,feature selection and feature fusion are carried out to construct a comprehensive feature of more distinguishable ability to extract the temporal node accurately from text nodes.Lastly,the publication time of Web news is extracted from the temporal node as a standardized format output.The experimental results show that this method is an accurate method for online publication time extraction of Web news.
Keywords/Search Tags:Web News Publication Time, Regular Expression, Text Node Feature Series, Feature Selection, Feature Fusion
PDF Full Text Request
Related items