Research On Information Filtering Method Of Text Pollution In Twitter

Posted on:2018-05-22

Degree:Master

Type:Thesis

Country:China

Candidate:Y L Li

Full Text:PDF

GTID:2348330512487987

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

With the rapid development of communication and information technology,the vibrant Internet industry is popular for its good real-time and promptness.Because of the network media’s virtue of the rapid and low cost of information dissemination,making users more favor in this emerging industry.At the same time,as a network major channel of information dissemination,the social media represented by Twitter,plays an increasingly important role in many of the world’s industry.Therefore,how to efficiently extract useful information from the massive amount of information in social media,has attracted the attention of researchers.Although Twitter provides a source of textual information for the researcher,there are many processing difficulties that are different from ordinary texts.On the one hand,through the observation of the tweets found that most of the tweets are meaningless noise information and redundant information,such as chat between users,retweet and so on.On the other hand,the social network is open,informal media,social networks’ users are arbitrary.The published text content will be affected by various factors,resulting in a large number of non-standard words.Based on the statistics and analysis of the original tweets,this thesis puts forward an effective solution to the text pollution information in Twitter from two aspects: non-standard word and junk tweet.The main contributions of this thesis are as follows:(1)Through the combination of traditional spelling error correction technology and word vector model,a scheme of dealing with non-standard words with semantic information is proposed.It is difficult for the shortest edit distance method to deal with non-standard words with large morphological difference.In this thesis,the semantic information of the word is used to normalize the word.In addition,this thesis designs and realizes the method of judging non-standard words,and realizes the scheme of improving the speed of word standardization combined with some auxiliary tools,which greatly reduces the range of words that need to compare semantic relevance.In the normalization experiment of the words in the tweets,it is proved that the scheme of error correction through the cosine distance of the word vector has some practical significance.(2)Through the combination of word vector and convolution neural network,a relatively complete and effective scheme for filtering the junk tweets in Twitter is formed.With the exploration of the application of convolutional neural network in tweets filtering,in this thesis,the pooling layer commonly used in convolution neural network is replaced by flattening layer,which reduces the loss of pooling,because of the text characteristics of tweets less than picture.In the actual test,this method makes positive effect of the classifier to a certain extent on actual tweet training set in the project research,and it can achieve better results without too much adjustment parameters.

Keywords/Search Tags:

Twitter, Pollution information, Filter, Word vector, Convolution neural network

PDF Full Text Request

Related items

1	Research On Word Vector-based Sentiment Classification
2	Semantic Similarity Measurement Of Short Text By Convolutional Neural Network Based On Multi-Dimensional Attention On Word Vector
3	Chinese Sentiment Analysis With Convolution Neural Network
4	Intelligent Classification Of Social Network Accounts
5	A Study Of Word Vector Extraction Based On Neural Network
6	Research On Sentiment Analysis Based On Convolution Neural Network Using Part-of-speech
7	Research And Implementation Of Text Sentiment Analysis System Based On Neural Network Model
8	Research On Chinese Word Sense Disambiguation Model Based On Bidirectional Recurrent Neural Network
9	A Study On Language Models Based On Neural Networks
10	Research On Linguistic Steganalysis Based On Word Embedding