Font Size: a A A

Research And Implementation Of Web Topical Information Extraction Method With Semantic Consideration

Posted on:2022-07-05Degree:MasterType:Thesis
Country:ChinaCandidate:B C FuFull Text:PDF
GTID:2518306740491944Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of its technology,the Internet has become the largest information carrier.However,most web pages often contain not only valuable information,but also information that has nothing to do with the topic,such as advertisements,navigation information,and copyright statements.These noises largely hinder the utilization of Internet information resources.On the other hand,most traditional topical information extraction methods focus on distinguishing topical information from noises,but leave the extracted information lacking a structured organization and in a mixture of title,content,time and author information.At the same time,the granularity of the extracted data is too coarse to be used efficiently.Therefore,the fine-grained topical information extraction is specified.Besides,traditional methods for extracting topical information highly depend on the HTML style of web pages.With the updates of HTML standards and the changes of the HTML style,the extraction effects of traditional methods has declined to various degrees.In view of this problem,a more robust semantic feature of natural language is introduced in this paper to further improve the accuracy and robustness of topical information extraction.The main work are as follows:(1)A HTML DOM node embedding model BERT-HTML is designed,which combines the Bi GRU model and the multi-head self-attention mechanism for the extraction of the HTML tag path feature,and uses a fine-tuning BERT model for the extraction of the semantic feature of DOM nodes' text.Finally,macro-average F1 values of 0.9885 and 0.9818 are obtained respectively on the English and Chinese test sets.(2)The proposed algorithm of topical information extraction is called WIESS.The core of the WIESS algorithm is the accurate extraction of text information.Besides,semantic similarity features are considered additionally.The machine learning model is then used to classify text paragraphs.A total text node voting algorithm is exploited to obtain text information.In order to extract topical information of title,time and author,heuristic methods are applied in WIESS.(3)Considering the poor results of WIESS and the traditional fine-grained topical information extraction algorithms on time and author extraction,the extraction algorithm WIEBH is proposed.Unlike the WIESS approach that considers semantic similarity features only for the text,WIEBH considers more general semantic features with the help of the DOM node embedding model BERT-HTML.Finally,based on the vector representations of them,DOM nodes are classified and the fine-grained topical information is extracted.(4)The ablation experiment results prove that the semantic similarity and semantic features play an important role in the topical information extraction;The comparison experiment results show that the algorithms proposed in this paper have the better extraction effect and robustness than the traditional algorithms.
Keywords/Search Tags:information extraction, HTML, semantic feature, deep learning, BERT
PDF Full Text Request
Related items