Research And Application Of Short Text Analysis Technology For Micro-blog

Posted on:2015-11-21

Degree:Master

Type:Thesis

Country:China

Candidate:Z Fang

Full Text:PDF

GTID:2308330473451553

Subject:Software engineering

Abstract/Summary:

With the popularity of micro-blog, there exists a large amount of text data on the Internet. Most of these data are generated text data by micro-blog users, which implies their interest characteristics. Through in-depth analysis, we can dig up the implicit information which can be used as analyzing data for other applications, such as the userâ€™s personalized recommendations. The purpose of this article is to mine the feature words, and then further identify the interest of users by studying from the micro-blog short texts. This thesis mainly includes the following contents:1. Propose a recognition method for new micro-blog words. For many new network words that are not recorded in the dictionary, a micro-blog new words recognition method is proposed. Firstly, the short texts preprocessing scheme is given according to the special form of micro-blog. Through prescribing functions of specific symbols â€œã€ã€‘â€ and â€œ##â€, the string can be extracted as alternative words. Then after dictionary and adjacent string filtering, mutual information is calculated and the mutual information with higher values will be taken out as a new word. The establishment of a new dictionary is to help improve the effect of Chinese word segmentation.2. Propose a feature extraction method for micro-blog short texts. Based on the micro-blog form, the feature extraction method which combines SVM, clustering method and LDA is presented after considerations of word frequency, sparse texts and latent semantic. At first, the data sets are clustered based on K-Means++ clustering method, then the approximate numbers of topics are determined and the data sets will be recombined according to the clustering results. After that, SVM is used to represent the texts, LDA to model, and probability distribution to extract feature words.3. Put forward a recognition method for micro-blog usersâ€™ interest. A dictionary based recognition method is given due to the results of feature words extraction. In line of micro-blog usersâ€™ feature words, the weight of every topic dictionary will be calculated and the theme with the weight greater than the given threshold will be chosen as the ultimate description of user behavior.4. Design a micro-blog user interest recognition system. In order to apply the methods given above to actual micro-blog data, a simple micro-blog user interest mining system is designed for intuitive demonstration of the results obtained. The system is divided into three layers, namely data acquisition layer, data analysis layer and application layer. Data acquisition is done according to the Sina API and the crawlers of open source search engine software. Data analysis is a collection of the three analytical methods proposed in this paper, which is used to analyze personal microblog data. Application layer is used for illustrating the results more intuitively by utilizing the text visualization method.

Keywords/Search Tags:

micro-blog, short text, topic model, feature extraction, Latent Dirichlet Allocation Model

Related items

1	Based On Expending Feature Of LDA For Microblog Short Text Classification
2	Study Of Text Evolution Analysis And Prediction Based On Topic Model
3	Research And Implementation Of Distributed Topic Clustering Technology For Text Flow
4	Research And Application Of Text Classification Model Based On Topic Model
5	Research And Application Of Topic Evolution Model Based On LDA
6	News Topic Discovery Research Based On The LDA Model
7	Research On Text Mining Based On Topic Model
8	Research On Classification And Topic Evolution Of Blog Based On LDA
9	Chinese Text Classification Method Based On Improved Topic Model
10	Research On Classification Algorithm Of Scientific Papers Based On Topic Model