Font Size: a A A

Web News Search System Based On UCL Knowledge Space

Posted on:2022-05-09Degree:MasterType:Thesis
Country:ChinaCandidate:X C ChangFull Text:PDF
GTID:2518306740983029Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the Internet has become the main channel for users to get news.The surging number of online news brings users a lot of resources,but also poses a greater challenge to news search services.Accurate news web page analysis is the prerequisite for organizing mass news,while the effective connection between news is the bridge to achieve highquality news search services.However,the existing search services have the following problems.First,news web pages contain rich event information and news-related elements which cannot be extracted accurately by existing search engines.In other words,users need to independently extract core words to search for further information.Secondly,the lack of uniform standards for the effective organization of chaotic news has led to the insufficient connection of news resources.Finally,the homogeneity of news content has affected the quality of information.Therefore,how to help users to search for accurate and refined results from the massive web page news through a systematic scheme has become the key to improve users' reading quality.This thesis proposes the use of Uniform Content Label(UCL)to aggregate fragmented news content effectively.The news entities,inter-entity relationships,and UCL indexing news are stored using knowledge graphs to construct UCL Knowledge Space(UCLKS).The connection between UCLs is built through the association of entities in UCLKS.Meanwhile,the UCL semantic information is used to enrich the single triplet structure in the basic knowledge graph,hence the quality work of news search can be completed.Specific work of the thesis is as follows:(1)For the problem of how to pick up the effective information contained in the massive web news accurately,a scheme for extracting news elements from web pages is proposed.The first step is to utilize content extraction based on text block feature fusion(CETDF)and an improved strategy for generating same-origin web page extraction templates.CETDF is to solve the problem of low accuracy of extracting complex page text by existing algorithms while the improved strategy is to reduce long extraction time in massive data scenarios.In addition,the news event triples are extracted using the algorithm based on enhanced dependency syntax analysis.The syntax level is modified for the problem of incomplete extraction of event modifiers.Finally,the indexing work of the news is completed using UCL.Besides,a method for calculating the semantic weight of entities in UCL is proposed,which enhances the semantic connection between entities and UCL.(2)Aiming at the lack of effective organization of news,the thesis suggests a UCL knowledge space construction method that integrates news elements.Firstly,offline data extraction of Wikipedia and Baidu Encyclopedia is completed.Through the entity integration of heterogeneous knowledge bases,the construction of basic knowledge bases is realized.Then,an entity disambiguation algorithm based on UCLKS(UCLKS-ED)is proposed.Through this algorithm,the performance of the disambiguation in the absence of context can be improved using the entity concept information and associated context information in the knowledge space,which can be considered as supplementary knowledge for the entity to be disambiguated and the candidate entity.Moreover,the formal definition of the node in UCLKS and the persistence scheme are given to complete the construction of UCLKS.(3)In response to the low quality of news search,this thesis proposes a news search scheme based on UCLKS.Firstly,this scheme suggests the UCL entity emphasis coefficient to further enrich the semantic representation of core entities in news.Subsequently,a news matching method via cooccurrence entity interaction graph(CEIG-NM)is proposed,which split the long text matching task into short text matching tasks performed on the nodes formed by the same entities.The graph convolutional neural network is used to aggregate node features to effectively solve the problem of poor similarity calculation of long texts.Likewise,a sorting strategy based on the entity emphasis coefficient is given.The sorting results are reasonably revised via news-related element information.(4)Based on the methods above,a prototype system of news search is established.The performance of the system is verified through experimental results.It is shown that CETDF obtains a better extraction accuracy than other text extraction algorithms by introducing more text block features.Also,the UCLKS-ED achieves the best performance in the short text entity disambiguation task.The ablation experiment proves the role of the UCL knowledge space and entity conceptualization module.Additionally,CEIG-NM has achieved good matching results on the latest Chinese news matching dataset.
Keywords/Search Tags:news search, UCL, knowledge space, entity disambiguation, text matching
PDF Full Text Request
Related items