Font Size: a A A

Research On Distributed Classification Methods For Short Text Data Streams

Posted on:2021-02-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y HuFull Text:PDF
GTID:2428330614460388Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Short text data streams emerging in real-world applications such as Weibo data,real-time barrage,and real-time commentary present the following two major characteristics.On the one hand,each text is very short,without sufficient contextual semantic information;Short texts flow quickly and present in a large volume,and the distribution of its class labels is constantly changing.The above characteristics cause the following problems such as information sparseness,and concept drift in the classification of short text data streams,which makes it difficult to apply traditional text classification methods directly.Therefore,how to quickly and efficiently process massive real-time short text data streams has become an important and challenging task for data stream mining in real-world applications.Based on this,this thesis focuses on the research of distributed short text data stream classification methods,and the main contributions are as follows.(1)In order to solve the problems of high-dimensional feature,information sparseness,and concept drift,a distributed short text data stream classification method based on Word2 vecis proposed.Firstly,an external corpus is used to construct the Word2 vecmodel for the vectorization of short texts.In terms of rich corpus information,an extended word vector library is built by obtaining the rare words to reduce the impact from the ambiguity information.Secondly,a distributed LR integration model is proposed to classify massive real-time short text data streams.In particular,our classifier parameters can be continuously updated online in real time with the arrival of the data stream;and a time factor is introduced to adapt to the concept drift environment.On the other hand,the distributed processing of the proposed method is achieved through the Apache Spark platform to improve the time performance of short text data stream classification method.Finally,experiments conducted on three real short text data streams show that as compared with the benchmark algorithm,the proposed method presents a lower time cost and a higher classification accuracy.(2)In order to further improve the classification accuracy and time performance of short text data streams,a deep learning based short text data stream classification method is proposed.Contrary to the aforementioned method,the proposed method first adopts multi-granular short text expansion,that is,considering the word granularity,the Word2 vec model is constructed Word2 vecfrom external corpus on word granularity to obtain the correlation between words to expand short text.Meanwhile,considering sentence granularity,CNN network is used to extract the deep semantic information of short texts to further enrich short text.Secondly,a distributed elastic neural network is proposed,it can self-expand the depth of the network model according to the environment of the current data stream.Meanwhile,a concept drift detector is designed to detect the concept drift,which can dynamically adjust the influence of historical information and input information on the final result in the network.Finally,the distributed processing of the proposed method is achieved through the Apache Spark platform,and experimental results show that the proposed method has a higher classification accuracy and a lower time cost compared to the well-known short text data stream classification methods and the above Word2 vecbased method.
Keywords/Search Tags:Data stream classification, short text, concept drift, distributed processing, deep learning
PDF Full Text Request
Related items