Font Size: a A A

A Study For Classifying Short Text In Social Media

Posted on:2019-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y WenFull Text:PDF
GTID:2348330569487730Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid development of Web2.0 technology and mobile Internet technology,people can release their feelings of what they see on social media anytime and anywhere,leading to an explosion of text data in social networks.Not just ordinary users,journalists,official agencies,political leaders,among others,have also posted messages in social media,which have created a wealth of valuable information in social media,however,the short length of the text,the irregularization of its external formats and content,and lots of spam,leading to severe challenges to the text classification methods in social media.Traditional classification methods not only cause high sparseness and dimensional disaster of text feature vectors,but also lose the word order information of texts and carry noisy words and sentences,which results in weak expression capability of semantic feature vectors.In order to overcome the above drawbacks,this article first filters non-news-event spam based on external features.For the more standardized news-event messages,based on deep learning,the semantic features of the text are automatically extracted and then processed.Multi-categorization at the theme level.Based on this,the main work and innovation of this article are as follows:1.Propose a short text classification method based on external features for social media.Aiming at the massive noise and spam information of short texts in social media,this paper takes Twitter as the research object and extracts 16 external features related to the number of words,sentence patterns,sentiment trends,special words,special characters,etc.,based on the tweets' format.Through the external features,the tweets of news-events tweets and non-news-events tweets are well differentiated,and the dimensions and sparseness of traditional text feature vectors are effectively reduced.Due to the independence of external features and the variety of value types,and the integrated model's improvement of the generalization performance of basic classifiers,this paper finally selected the random forest to achieve the two classifications of tweets to filter non-news-event spam tweets.2.Propose a short text classification method based on deep learning for social media.This paper applies the deep learning model C-LSTM to classification of news-event tweets based on their topics,which has sparse semantic information and diverse themes.Because of the self-extraction and classification of deep learning model set features,through training,C-LSTM not only automatically extracts the semantics,word order,and n-gram characteristics of social media short texts,it avoids cumbersome artificial feature construction projects,and it automatically “forgets” news through special structures such as inputting doors and oblivious doors.The noise information in the event tweet directly captures keywords and sentences directly related to the topic or emotion,so as to achieve multiple classifications of news events tweets from the theme level.In order to verify the effectiveness of the proposed method,we use 2400 tweets already labeled as the training set for tweets classification based on external features.Through cross-validation,the classification performance index of this method is about 13% higher than the traditional TF-IDF model and about 3% higher than the mainstream method.For the semantic-based tweets classification method,this paper uses four published social network text data sets to evaluate the experimental results.Through cross-validation,the classification accuracy of the C-LSTM model in this paper is higher than that of CNN 3.51%,an increase of 7.28% over the traditional word2 vec weighted text vector construction method.
Keywords/Search Tags:social media, external features, deep learning, short text classification
PDF Full Text Request
Related items