Font Size: a A A

Research On Topic Detection And Tracking Technology Based On Spark

Posted on:2019-05-21Degree:MasterType:Thesis
Country:ChinaCandidate:Q XuFull Text:PDF
GTID:2348330569987666Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the development and popularization of the Internet,large amounts of information on Internet is produced every day all over the world.In order to obtain the information of hot topics and their trend of the subsequent development from the heavy and complicated Internet information in a timely manner,topic detection and tracking technology has been used to the recognition of unknown topics and continuous tracking of known topics from massive Internet information.With the explosive growth of Internet data,traditional topic detection and tracking technology hit a performance bottleneck when processing a large number of data.To improve the accuracy and efficiency of topic detection and tracking technology in the face of the large-scale data,this thesis studies the parallel method of topic detection and tracking based on the big data processing framework Spark.The main contents of this thesis are as follows:(1)A new method of parallel topic detection based on Single-Pass clustering algorithm is proposed.This method has two main steps: text representation and text clustering.For text representation,this method improves text vectorization by parallel realization,and proposes to use the sparse vector for the expression of text feature to reduce memory usage and computation overhead and use the feature weights based on position to highlight the thematic information of text.For text clustering,this method proposes a parallel Single-Pass clustering algorithm to improve computational efficiency and the scale change of text similarity calculation is proposed to improve the accuracy of this algorithm.Combined with the improvement of text representation and text clustering,the computational process of the parallel topic detection method and the computational procedure based on Spark are presented in this thesis.It proved this method has good accuracy and parallelization performance by experiments on manual annotation data and various kinds of the large scale data.(2)A new method of parallel topic tracking based on frequent word set is proposed.The method collects the topic text sets from the information flow by the parallel topic detection method proposed in this thesis and uses FP-Growth algorithm to mining the frequent word set of each topic text set as the topic representation.Then the similarity between the existing data and the tracking data will determine the affiliation of the topics and texts.In this method,word set is used to represent multiple texts in the same topic to reduce the similarity computation overhead greatly.EMD(Earth Mover's Distance)and Word2 vec word vector model are used to compute the cosine similarity between word sets to improve the accuracy of word sets' similarity comparison.This method is implemented based on Spark.It proved that this method is accurate and efficient on topic tracking by experiments on the relevant corpus.
Keywords/Search Tags:Topic detection and tracking, Parallelization, Frequent word set, Single-Pass clustering algorithm, Spark
PDF Full Text Request
Related items