Font Size: a A A

Text Data Stream Concept Drift Detection And Dynamic Topic Detection

Posted on:2022-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:S K LinFull Text:PDF
GTID:2518306563965099Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Today,with the continuous development of the mobile Internet,people gradually like to express their views or opinions on events happening in their region,all over the country,and even any corner of the world on online social applications such as Twitter and Weibo,for example,common posts,comments,etc.This situation leads to the generation of a lot of text data stream which has become sensors of real-world events due to their large amount of data,high real-time performance,and wide participation.Faced with such a huge amount of text data,analyzing and extracting the hot topics being discussed in it not only provide people with a way to understand real-time news,and it can also help government agencies guide public opinion.However,compared with traditional news media,social media-oriented text data stream generally has the characteristics of short content,irregular data formats,and more interference data,which makes them face more difficulties in topic detection tasks.In addition,the dynamic change in the distribution of the text data stream itself,that is,the concept drift,also brings certain limitations and challenges for hot topic detection.To explore the above issues,this paper proposes a dynamic hot topic detection algorithm based on text data stream.The main work is listed as follows:(1)Aiming at the frequent concept drift phenomenon in the text data stream,this paper proposes a concept drift detection algorithm KWTDD based on the Kruskal-Wallis statistical test method.KWTDD can timely and accurately detect changes in the data distribution of the text data stream,and notify the data stream online learning model to dynamically update to quickly adapt to the changed data stream,thereby improving the learning effect of the data stream model.In addition,this paper also designs a drift pre-judgment module to quickly skip the steady phase in the text data stream for effectively accelerating the execution efficiency of the original online learning model.(2)In order to realize the hot topic detection of text data stream,this paper proposes a cluster-based topic detection algorithm CHClustream.The algorithm mainly consists of two parts.One is the CHECM topic clustering algorithm,which aims to improve the DBIECM clustering algorithm such as low time efficiency and poor clustering effect.The second is the TF-IDF-AE topic extraction algorithm based on the original TF-IDF,which uses text attention,user influence and other factors to enhance the topic extraction algorithm,which aims to improve the topic extraction accuracy when the number of topic detections is small.(3)For the two algorithms proposed above,this paper designs multiple sets of comparative experiments.First,based on the MOA(Massive Online Analysis)data stream experiment platform,KWTDD and other 7 common concept drift detection algorithms were tested on 27 data sets.The results show that the data flow learning model using KWTDD as the concept drift detector performs best,obtaining the highest F1 score on the 83.33% artificial data set.After that,this article will experiment with CHClustream on the COVID-19 data sets containing 1 million tweets.Compared with the other 6 common topic detection algorithms,CHClustream not only obtains a higher topic recall rate,but also relatively early Topic detected.Finally,the experiment of combining KWTDD with CHClustream to achieve topic detection also verifies that KWTDD can improve the real-time performance of topic detection.
Keywords/Search Tags:Text Data Stream, Concept Drift, Statistical Test, Topic Detection, Data Stream Clustering
PDF Full Text Request
Related items