Font Size: a A A

Short Texts Feature Extraction And Classification Techniques For Supporting Multi-level Semanteme

Posted on:2015-07-20Degree:MasterType:Thesis
Country:ChinaCandidate:X G JiaFull Text:PDF
GTID:2308330482456053Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, lots of data is preserved in computers in the form of short texts, such as the advertising verbals, paper titles, web comments, and Twitter messages. The way of mining, analyzing and classifing these short texts is a hot topic in the data mining field, and many short texts classification techniques have been proposed in recent years.The existing methods only expand synonyms and homoionyms into a large word set to guide short texts classification. However, large numbers of irrelevant features are introduced in these methods. The level of semantic relationships and combination semantic between words are not considered, too. Therefore, this thesis mainly focuses on extracting multi-level semantic features which are used to guide the short text classification from short texts.This thesis firstly summarizes the existing text classification technologies, and then proposes a feature extraction and classification framework which supports multi-level semanteme of short texts. The framework can extract multi-level semantic features from short texts and guide the short text classification effectively. In the process of word segmentation, this thesis proposes an improved segmentation method based on the principle of part-of-speech tagging, in which the features of short texts are conserved as many as possible. In the process of feature expansion, this thesis expands the orginal words in short texts into concepts, instances and attributes based on a knowledge-base named Probase, and then generates the feature dictionary by merging the intersection of the features in same class. Finally, the optimization algorithm which based on greedy algorithm is designed to reduce the dimension of feature dictionary to guide the short text classification. The method based on Probase only focuses on expanding the latent semantic of a single word, without taking the combination semanteme among words into account. Faced with the above problems, this thesis adopts Latent Dirichlet Allocation to generate the topic features of short texts, and then classifies short texts with the topic features. In addition, this thesis maps four levels of text semantic features into the same vector space which represents multi-level semantic features, which greatly improves the accuracy of short text classification.Finally, in order to verify the feasibility of multi-level semantic short texts feature extraction and classification techniques, many experiments are conducted with real data sets. The results of experiment demonstrate the feasibility and soundness of our methods.
Keywords/Search Tags:short text, classification, multi-level semanteme, feature extraction, Probase, LDA, SVM
PDF Full Text Request
Related items