Font Size: a A A

Research On Information Filtering Method Of Text Pollution In Twitter

Posted on:2018-05-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y L LiFull Text:PDF
GTID:2348330512487987Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of communication and information technology,the vibrant Internet industry is popular for its good real-time and promptness.Because of the network media's virtue of the rapid and low cost of information dissemination,making users more favor in this emerging industry.At the same time,as a network major channel of information dissemination,the social media represented by Twitter,plays an increasingly important role in many of the world's industry.Therefore,how to efficiently extract useful information from the massive amount of information in social media,has attracted the attention of researchers.Although Twitter provides a source of textual information for the researcher,there are many processing difficulties that are different from ordinary texts.On the one hand,through the observation of the tweets found that most of the tweets are meaningless noise information and redundant information,such as chat between users,retweet and so on.On the other hand,the social network is open,informal media,social networks' users are arbitrary.The published text content will be affected by various factors,resulting in a large number of non-standard words.Based on the statistics and analysis of the original tweets,this thesis puts forward an effective solution to the text pollution information in Twitter from two aspects: non-standard word and junk tweet.The main contributions of this thesis are as follows:(1)Through the combination of traditional spelling error correction technology and word vector model,a scheme of dealing with non-standard words with semantic information is proposed.It is difficult for the shortest edit distance method to deal with non-standard words with large morphological difference.In this thesis,the semantic information of the word is used to normalize the word.In addition,this thesis designs and realizes the method of judging non-standard words,and realizes the scheme of improving the speed of word standardization combined with some auxiliary tools,which greatly reduces the range of words that need to compare semantic relevance.In the normalization experiment of the words in the tweets,it is proved that the scheme of error correction through the cosine distance of the word vector has some practical significance.(2)Through the combination of word vector and convolution neural network,a relatively complete and effective scheme for filtering the junk tweets in Twitter is formed.With the exploration of the application of convolutional neural network in tweets filtering,in this thesis,the pooling layer commonly used in convolution neural network is replaced by flattening layer,which reduces the loss of pooling,because of the text characteristics of tweets less than picture.In the actual test,this method makes positive effect of the classifier to a certain extent on actual tweet training set in the project research,and it can achieve better results without too much adjustment parameters.
Keywords/Search Tags:Twitter, Pollution information, Filter, Word vector, Convolution neural network
PDF Full Text Request
Related items