Study On Short Text Classification Algorithms Based On Mutual Information

Posted on:2013-03-28

Degree:Master

Type:Thesis

Country:China

Candidate:Y Wang

Full Text:PDF

GTID:2298330467976172

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

In the modern information society, information exposion and all kinds of electronic communication technology emerge in endlessly, together with the rapid development of Internet. Massive amounts of short text are crashing into peopleâ€™s daily lives. The style of short text are diverse, such as, email, tweet, mobile short message, news headlines, the book and movie review, commodity introduction and comment, the business information circulation in enterprises and so on. Because these short texts are emerging rapidly and have rich meanings, people are more and more interested in short text. Text processiong technology can help us get needed resource and information effectively, so it can help people with work and studying.So-called short text is the texts that its length is very short, usually within100words. It is always concise and simplified but has rich meaning. Text similarity calculation is a basic and important work in text processiong. It is the key technology in text classification. Nowadays, there are a lot of methods about how to compute text similarity, for example, TF-IDF method, similarity calculation based on smantic understanding, LSI. But these technologies do not take the statistical information and semantic information into consideration together with the inherent connection. They are not capable of the similarity calculation of the short text. Traditional text classification algorithms are focus on the normal text which is long text in another word. There are a great many of technology about it, such as K-NN, Bayesian network, Maximum entropy and SVM. These technogies are stable in performance, with high efficiency, and valid to long text classification. But they can not cope with the short text classification, because it is short and its feature vector matrix is sparse. So it brings a greater challege to the text procession.Based on the above analysis, this paper propse a new short text similarity function which consider the characteristics of short text which has a short length, fewer feature words, and high semantic connections. It not only takes the mutual connection between features, but also ensures the precision of the value of similarity. It can express the semantic connections through the statistic information, in order to realize the measurement of the meaningful relation between texts. For short text classification, VSM is the best technology now. But it can not satisfy the high request in short text;lassification which has a sparse feature vector matrix. Through the analysis of the short text characters, it is found that the keyword have a crucial effect on classification. So this paper proposes a keyword classification to help SVM in short text classification.At last, a series of experiments show the superiority of short text similarity function based on mutual information, and the evaluation of the KeyWord-SVM algoriths in various performance indicators. The result of the experiments express that these two technologies have a good performance, and stable effect, they realize the quick and efficient short text processiong.

Keywords/Search Tags:

short text classification, mutual information, keyword extraction, SVM, shorttext similarity

PDF Full Text Request

Related items

1	Research On Short Text Classification
2	Chinese News Text Classification Combining Keyword Extraction And Attention Mechanism
3	Research On Keyword Extraction Technology Oriented To Conversational Text
4	Forum Data Extraction Based On Similarity Calculation
5	Research On Short Text Classification Based On Feature Expansion
6	Title Classification Research Of Collected Documents Based On Subject Matching
7	Analysis Of Text Information Based On Deep Learning
8	Applications Of Hierarchical Keyword Extraction And Automated Text Classification In Bulletin Board System
9	Research On Semantic Similarity Based On Text Categorization
10	Chinese Keyword Extraction Method Based On Word Span And Its Application In Text Classification