Font Size: a A A

Study On Short Text Classification Algorithms Based On Mutual Information

Posted on:2013-03-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2298330467976172Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the modern information society, information exposion and all kinds of electronic communication technology emerge in endlessly, together with the rapid development of Internet. Massive amounts of short text are crashing into people’s daily lives. The style of short text are diverse, such as, email, tweet, mobile short message, news headlines, the book and movie review, commodity introduction and comment, the business information circulation in enterprises and so on. Because these short texts are emerging rapidly and have rich meanings, people are more and more interested in short text. Text processiong technology can help us get needed resource and information effectively, so it can help people with work and studying.So-called short text is the texts that its length is very short, usually within100words. It is always concise and simplified but has rich meaning. Text similarity calculation is a basic and important work in text processiong. It is the key technology in text classification. Nowadays, there are a lot of methods about how to compute text similarity, for example, TF-IDF method, similarity calculation based on smantic understanding, LSI. But these technologies do not take the statistical information and semantic information into consideration together with the inherent connection. They are not capable of the similarity calculation of the short text. Traditional text classification algorithms are focus on the normal text which is long text in another word. There are a great many of technology about it, such as K-NN, Bayesian network, Maximum entropy and SVM. These technogies are stable in performance, with high efficiency, and valid to long text classification. But they can not cope with the short text classification, because it is short and its feature vector matrix is sparse. So it brings a greater challege to the text procession.Based on the above analysis, this paper propse a new short text similarity function which consider the characteristics of short text which has a short length, fewer feature words, and high semantic connections. It not only takes the mutual connection between features, but also ensures the precision of the value of similarity. It can express the semantic connections through the statistic information, in order to realize the measurement of the meaningful relation between texts. For short text classification, VSM is the best technology now. But it can not satisfy the high request in short text;lassification which has a sparse feature vector matrix. Through the analysis of the short text characters, it is found that the keyword have a crucial effect on classification. So this paper proposes a keyword classification to help SVM in short text classification.At last, a series of experiments show the superiority of short text similarity function based on mutual information, and the evaluation of the KeyWord-SVM algoriths in various performance indicators. The result of the experiments express that these two technologies have a good performance, and stable effect, they realize the quick and efficient short text processiong.
Keywords/Search Tags:short text classification, mutual information, keyword extraction, SVM, shorttext similarity
PDF Full Text Request
Related items