Font Size: a A A

Research On Key Technologies Of News Aggregation Based On Automatic Summarization

Posted on:2022-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:H J ZhouFull Text:PDF
GTID:2518306740982769Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,online media has come into people's daily life,bringing them great convenience in obtaining news and broadening the horizon.In the new media era,Internet news has the characteristics of fragmentation and massive data,which lead to the problems of information redundancy and content dispersion.News aggregation can simplify the massive multi-source news and correlate the news at the topic level,and realizes the orderly governance of news data by unified organization and efficient presentation of correlated news.However,there are three main challenges of the above goal,First,with inadequate information,Internet news headlines are difficult to involve the key points of a news text completely,in the meanwhile,the phenomenon of clickbait makes headlines less credible.Second,for the mutiple news on the same topic with different perspectives and content,it's hard to correlate the news content and summarize the main information.Third,with the lack of a unified structure to orderly organization multi-source heterogeneous news,the structure of the aggregated news tends to be unfocused and fragmented.In response to the challenges above,this thesis proposes a single-document summarization algorithm based on key information(KI-SSUM)and a multi-document summarization algorithm based on sub-topic representation(STHT-MSUM).Combining the proposed algorithms,this thesis designs a hierarchical news aggregation method based on Uniform Content Label(UCL).The main work of this thesis is as follows:(1)Aiming at abstracting the main points of a news text accurately and comprehensively,this thesis proposes a single-document summarization algorithm based on key information,named KI-SSUM.Firstly,a key information extraction network is introduced to extract the topic information and element information of the text,and both two are concated as the key information.Then,the key information is incorporated into the attention mechanism for guiding the generation of abstarcts.Finally,in order to ensure the relevance between the topic information and the document,a multi-task joint training method is designed to achieve the simultaneous learning of topic extraction and abstract generation through the topic consistency constraints of the source document and the true abstract(2)To describe the topic content of multiple news completedly and clearly,a multidocument summarization algorithm based on sub-topic representation,named STHT-MSUM,is proposed in this thesis.Firstly,this thesis uses Transformer and Bi LSTM to extract the subtopic representation of each source document and constructs the central topic representation of the input documents for generating more topic-relevant document vectors by attention mechanism.Then,the information gating mechanism is designed to obtain word vectors with more salient features by filtering the word information.Finally,the hierarchical attention mechanism is introduced to integrate word-level and document-level information to provide rich hierarchical features for the generation of abstracts.(3)In order to organize and present multi-source heterogeneous news uniformly and orderly,this thesis proposes a hierarchical news aggregation method based on UCL.First of all,the collected news webpages are indexed by UCL to form a UCL news pool;Secondly,this thesis uses the KI-SSUM algorithm to generate the news abstract and topic representation for each news document in the UCL news pool,and performs topic clustering based on the topic representations.Then,for the news documents of the same topic cluster in the UCL news pool,the STHT-MSUM algorithm is utilized to integrate their main information into one topic abstract.Finally,combined with the news abstracts generated by KI-SSUM and the topic abstracts generated by STHT-MSUM,the news and the topics are indexed and associated by UCL to make up the aggregated news UCL labels with the concise content and clear structure.(4)Based on the above methods,this thesis designs the news aggregation prototype system based on automatic automatic summarization,and verifies the KI-SSUM algorithm,the STHTMSUM algorithm and the news aggregation method proposed through experiments.The experimental results show that the KI-SSUM algorithm has a higher improvement in the evaluation indicators than the traditional single-document abstractive summarization algorithms,the STHT-MSUM algorithm has better performance than traditional algorithms on the multi-document summarization task.With the two algorithms,the method proposed in this thesis is able to effectively realize news-oriented information aggregation and content governance,and gives users more convenient access to Internet news.
Keywords/Search Tags:news aggregation, automatic summarization, topic clustering, deep learning, uniform content label
PDF Full Text Request
Related items