Font Size: a A A

News Topic Mining And Evolution Analysis Based On UCL

Posted on:2021-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:W H LiFull Text:PDF
GTID:2518306476953109Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology and applications,rich information resources have brought great convenience to users.However,problems such as information redundancy and the difficulty in managing network content have become increasingly prominent.Focusing on news articles,documents with important and hot topics may be obscured by useless ones;and when different news articles that describe the same topic are at a longer interval,it is probable for people to ignore their connection.News topic mining and evolution analysis task aims to detect and track hot topics from news documents and describe the content of each topic with brief words or sentences,so as to sort out the development of topic over time for readers.However,traditional topic models often use little statistical information from data set when solving the related problems,in which the semantic extraction from words and text is insufficient.Besides,for topic description,only simple keywords or entities are usually used,which makes readers hard to understand the main content and changing process of topics clearly.Aiming at the problem that existing topic models do not extract semantic features fully,and considering the advantages of structured entities in existing researches,this thesis first proposed a method to convert news into structured data based on Chinese national standard of UCL(Uniform Content Label).Then on this basis,a deep neural network model named WDTopic is proposed,which extracts information from both entities and original news documents for topic detection and tracking on different time slices.In order to solve the problem that the existing topic representations are not readable,this thesis first designs an extractive summary model named XLNet Sum,which is used to extract the key sentences of news documents;then summaries at three levels of document,time slice,and topic are generated,with which structured entities are indexed into UCL to describe topics to enhance the readability of the topic evolution process.The main contributions of this thesis are as follows.(1)Focusing on the problem of the insufficient semantic extraction of exsiting models,and considering the advantages of structured entities in existing researches,this thesis first designs a structured indexing strategy for news documents based on the national standard of UCL,which provides semantic information for subsequent models.Then a WDTopic model based on the Wide & Deep structure is proposed for topic clustering,in which the Wide part is used to obtain the semantic information from the entities while the Deep part of a neural language model is used to obtain the semantic information from original documents.Finally,with time slice division mechanism,WDTopic model is applied to topic detection and tracking task with time information.(2)For the problem of the poor readability of existing topic representations which makes the evolution process of topics hard to understand,this thesis first designs a topic-oriented UCL indexing strategy,which uses readable summaries changing over time to give a semantic-based topic evolution process to make topics understandable.Then,in order to produce summaries automatically,an extractive summarization model named XLNet Sum is proposed to compute the importance of sentences in an article by using XLNet,so as to generate summaries of the news.And with the relevance of the news and the topic,the summaries of time slices and the entire topic are obtained.Finally,after the automatic indexing of the topic UCL by using these summaries changing over time,the evolution process of topics are reasonable and understandable.(3)The comparative experiments and the ablation experiments of WDTopic and XLNet Sum are designed,respectively.The results on two different datasets show that WDTopic has better performance than traditional models on the task of topic clustering.And the experiments on CNN/Daily Mail dataset show that XLNet Sum is able to generate better extractive summaries compared to baseline models.With the two models,the method proposed in this thesis is able to detect and track topics from news ariticles,and generate readable summaries.Based on the above experiments,a prototype system for topic mining and evolution analysis based on UCL is designed and implemented,which is able to collect news from the Internet in real time and complete the task of news-oriented topic detection,tracking and presentation.
Keywords/Search Tags:topic model, topic mining and evolution analysis, extractive summarization, uniform content label
PDF Full Text Request
Related items