Font Size: a A A

Based On Heterogeneous Information Network Research On Internet News Event Discovery Algorithm

Posted on:2022-10-05Degree:MasterType:Thesis
Country:ChinaCandidate:H H LiFull Text:PDF
GTID:2518306536496574Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
The Internet has become an important way for people to obtain external information with the advent of 5G era.While enjoying fast information services,people are also facing the ‘difficult choice' problem which caused by complex network information.Event discovery and tracking technology can help us quickly and accurately identify the latest events,find topics of interest and track event developments in massive news information.This technology is conducive to enterprises and governments to grasp the trend of public opinion,and plays an important role in the harmonious development of society.From the perspective of practical application,this subject takes the news report texts of People's Daily Online and Xinhua Online as the main research objects,and deeply studies the data acquisition technology based on Scrapy and text vector representation algorithm and the heterogeneous information network(HIN)and News event discovery method combined with Transformer mechanism.It can quickly predict the topic category of the newest report and discover related events,which is convenient for users to quickly obtain target information.First of all,there are long time consuming and high repetition rates in the entire network crawling.To solve these problems,this article designs an incremental web crawler to obtain data from People.com and Xinhua.com,which can be crawled on these two websites regularly.The relevant news information is automatically removed according to the address link,which greatly reduces the repetitive work in the collection process.Secondly,in view of the problem that TFIDF values corresponding to different keywords that appear the same number of times may also be the same in the traditional term frequency-inverse document probability algorithm(TF-IDF),this paper proposes an improved algorithm named A-TFIDF.This algorithm make it possible to ensure the uniqueness of the value without losing the importance of the keyword by giving the same TFIDF value plus different minimum values.Thirdly,in order to solve the problems of the difficulty of text vector representation caused by the length of news texts and the low accuracy of traditional word frequency based event discovery,this paper proposes an event discovery framework called TRHIN_Framework that combines Transformer mechanism and Heterogeneous Information Network(HIN).The Transformer for language translation is applied to topic prediction.The framework first determines the topic category of the new report,then uses topic words to obtain related event groups and builds a Heterogeneous Information Network,then uses Graph Attention Network(GAT)to extract high-dimensional features,and finally uses DBSCAN clustering to obtain the final event cluster.Finally,this research is carried out on the news data sets of People's Daily Online and Xinhua net.Through comparative experiments,the validity of the framework in topic prediction and event discovery and tracking is demonstrated.
Keywords/Search Tags:News Event Discovery, A-TFIDF, TRHIN_Framework, HIN, GAT, DBSCAN
PDF Full Text Request
Related items