Font Size: a A A

Research And Application Of Short Text Analysis Technology For Micro-blog

Posted on:2015-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:Z FangFull Text:PDF
GTID:2308330473451553Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the popularity of micro-blog, there exists a large amount of text data on the Internet. Most of these data are generated text data by micro-blog users, which implies their interest characteristics. Through in-depth analysis, we can dig up the implicit information which can be used as analyzing data for other applications, such as the user’s personalized recommendations. The purpose of this article is to mine the feature words, and then further identify the interest of users by studying from the micro-blog short texts. This thesis mainly includes the following contents:1. Propose a recognition method for new micro-blog words. For many new network words that are not recorded in the dictionary, a micro-blog new words recognition method is proposed. Firstly, the short texts preprocessing scheme is given according to the special form of micro-blog. Through prescribing functions of specific symbols “【】” and “##”, the string can be extracted as alternative words. Then after dictionary and adjacent string filtering, mutual information is calculated and the mutual information with higher values will be taken out as a new word. The establishment of a new dictionary is to help improve the effect of Chinese word segmentation.2. Propose a feature extraction method for micro-blog short texts. Based on the micro-blog form, the feature extraction method which combines SVM, clustering method and LDA is presented after considerations of word frequency, sparse texts and latent semantic. At first, the data sets are clustered based on K-Means++ clustering method, then the approximate numbers of topics are determined and the data sets will be recombined according to the clustering results. After that, SVM is used to represent the texts, LDA to model, and probability distribution to extract feature words.3. Put forward a recognition method for micro-blog users’ interest. A dictionary based recognition method is given due to the results of feature words extraction. In line of micro-blog users’ feature words, the weight of every topic dictionary will be calculated and the theme with the weight greater than the given threshold will be chosen as the ultimate description of user behavior.4. Design a micro-blog user interest recognition system. In order to apply the methods given above to actual micro-blog data, a simple micro-blog user interest mining system is designed for intuitive demonstration of the results obtained. The system is divided into three layers, namely data acquisition layer, data analysis layer and application layer. Data acquisition is done according to the Sina API and the crawlers of open source search engine software. Data analysis is a collection of the three analytical methods proposed in this paper, which is used to analyze personal microblog data. Application layer is used for illustrating the results more intuitively by utilizing the text visualization method.
Keywords/Search Tags:micro-blog, short text, topic model, feature extraction, Latent Dirichlet Allocation Model
PDF Full Text Request
Related items