Font Size: a A A

Research On Microblog Data Preprocessing And Topic Detection

Posted on:2015-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2268330428980089Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, microblog, as a new form of online media, isplaying an increasingly important role in people’s daily network behaviors such asinformation access, delivery and retrieve. Compared with traditional media data, microblog isfeatured with short text, real-time forwarding and commenting and fast topic spread speed,which makes microblog data new study object. Microblog topic detection technology, focuseson how to manage and classify a large amount of microblog data information, has become oneof the hot spots of current microblog study. This passage conducted a series of studies from3aspects: microblog data collection, short text data preprocess and microblog topic detection.By collecting related microblog data and combining with brief text and structural informationfeatures of microblog, this paper studies microblog data preprocess and microblog topicdetection methods based on traditional topic detection methods. The main works are asfollows:Collect relevant microblog data based on microblog open API interface. In terms of datacollection, this passage introduces web crawler technology and microblog data accessprogram based on microblog open platform, and analyzes the advantages and disadvantagesof the two methods of data collection through data access experiment. Results show that themicroblog open platform is more effective to access data, so the experimental data in thispaper is obtained by calling microblog API interface with program.Put forward a new method for text feature extension for microblog short text. In themicroblog short text data preprocessing field, this paper proposes an effective representationof the expansion of data characteristics, namely increase the number of text features throughthe use of multilingual translation machinery. People use additional knowledge which isgotten from other languages to enrich short text features, and then carry out short textintegration and dimensionality reduction with Matrix factorization, which reduces short textmining problems in a certain extent.Improve single incremental clustering algorithm for detecting microblog topic.Combined with brief text and structural information features of microblog, this paperimproves single incremental clustering algorithm based on traditional topic detection methods. By adopting the similarity maximum threshold and minimum threshold strategy, forward andcomment relationship between microblog text and friendship between microblog users friends,this paper proposes MB-SinglePass microblog topic detection algorithm, which presentsbetter detection effect in the experiments.
Keywords/Search Tags:short text preprocess, microblog data, topic detection, data collection
PDF Full Text Request
Related items