Font Size: a A A

Research On Short Text Classification

Posted on:2016-05-01Degree:MasterType:Thesis
Country:ChinaCandidate:H ZhangFull Text:PDF
GTID:2308330470468960Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Recent years, social network has been more and more important in our life along with the rapidly development of the Internet, which has even become the most important channel for the people to communicate and get information, just like facebook and twitter abroad and sina and tencent weibo domestically. As most of these data exists in the form of text and usually has word limit, the research of this essay is very imperative. Therefore it is particularly important to get effective and useful information timely by text mining. Traditional text mining usually includes text categorization, text clustering, and document summary, etc. With the widespread use of text classification technology,it has-also become a hot area of research.In this paper, we have proposed the main characteristics of short text, its main study field, research status and key technologies as well. Considering its shortness, sparsity and large amount of short text, we introduced the LDA (Latent Dirichlet Allocation) model combined of information gain feature selection algorithm to improve the efficiency of classification.Traditional text representation Model is the VSM (Vector Space Model), which is usually featured by word or phrase, and document set is represented as the document-vocabulary matrix. But for the short text, as its number of words is limited, so the probability of same word in different essay will be far less than the long text, making the traditional long text classification method cannot be directly applied to it. What’s more, because of its large amount of calculation, we need to improve the efficiency of through feature dimension reduction. LDA model is a three-level hierarchical Bayesian model for unsupervised learning, it can directly obtain semantic information hidden in the documents without the need for a outer knowledge bases. while IG feature selection formula considers both cases of feature appear and not happen, so it is remarkable of removing the stop words.Based on the above problems, in part 3, we used IG to decrease feature dimension and then conduct topic modeling by LDA, and finally build text class model with the topic as feature. The micro_F1 value of comparative experiment shows that the classification performance of short-text has been significantly improved.The classification performance of traditional information gain algorithm will significantly decrease when class and feature item is uneven of distributed. So in part 4, we first aimed at improving information gain method and then combined with the LDA model for text classification. First using features distribution uniformity and relation tree model to reduce feature dimension within the classes and decrease feature redundancy to solve the bad effect of category distribution imbalance on feature selection. And then use the weighted dispersion between classes as an equilibrium factor to improve the information gain formula thus to further enhance the accuracy of the information gain value of the feature between classes, and get better feature subset. The next step is to combine the LDA topic modeling for text classification. By comparing the experimental results, it shows that short text classification performance is further improved.
Keywords/Search Tags:Short Text Classification, LDA Topic Model, Information Gain, Feature Selection, Redundant Feature
PDF Full Text Request
Related items