Font Size: a A A

Study On Short Text Data Mining Based On Social Media

Posted on:2019-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:N N DuFull Text:PDF
GTID:2348330566464287Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of internet technology,social media has become the pacemaker of modern internet technology relying on its specific network convenience.Social media has become an important way of increasing people-to-people communications,obtaining and spreading news,such as Weibo,WeChat,Twitter and Facebook.It's a very practical work to find the contents that people are willing to obtain from the data information of social media.However,social media text has a different feature from the traditional text information,which cause the analysis with the traditional data mining technology on social media text works not very well.Based on the above-mentioned background,this dissertation conducts the work aiming at the social media relevant data mining research.According to the characteristics of social media texts,this dissertation conducts research on the two significant technologies of text mining technology.Firstly,this dissertation proposes a feature extraction methods for social media information.Currently,it's a hot research issue that extracting and analyzing the people needing social attributes from social media.Abstract the whole information picture of users by feature extraction,that provides evidence to support further analysis with users' interests accurately and quickly.One of the most direct methods for feature extraction is keyword extraction,many researches have been published for keyword extraction,but not ideal for social media short text with short content and non-standard format.This dissertation proposes an improved method for feature extraction based on Word2 vec and TextRank and applies this algorithm on social media text research.Using Word2 vec model to map text content to an abstract word vector space,improved traditional TextRank algorithm through the semantic features between words,word frequency and directional relation between words.Besides,user labels are generated by this algorithm.Experimental results show that the W-TextRank algorithm proposed in this dissertation has a 30%,15% and 20% higher accuracy than traditional algorithms on Sina Weibo's accuracy,recall rate and F-measure.The time cost saved nearly 30% than the traditional TextRank algorithm.Secondly,this essay has put forward a short text classification method of social media based on word vector.Since social media short text has the characteristics of short length,huge interference,irregular,features sparseness and other characteristics,the traditional classification algorithm is difficult to obtain satisfying classification results,besides,the text representation that based on traditional bag of words model is unable to make satisfying text representation of sentences.All of these factors make it difficult to study the social media text.This dissertation studies on such problems.Begin with words feature representation,studies on social media short text classification based on Word2 vec model and Convolution Neural Network model(CNN).Since Word2 vec model and the CNN model don't consider word's order and the location,word vectors trained by Word2 vec model,word order and location information are integrated further,proposes w-Word2vec(WW)and seq-Word2vec(SW)algorithm.Finally,it has input the word vector that related to the word order and the position to the CNN model for training.Experiments show that the proposed method(SW-CNN algorithm and WW-CNN algorithm)have increased the accuracy of 2.7% and 3.3% compared with that of traditional CNN algorithm on Multi-label classification of short texts in social media.The study of this dissertation can make up the shortcomings of feature extraction and text classification on social media short text,also provides a reference method for analysis of users' interests and habits.So the methods are theoretical significance and useful.
Keywords/Search Tags:social media, short text, feature extraction, text classification
PDF Full Text Request
Related items