Font Size: a A A

Research Of Chinese Short-Text Classification

Posted on:2012-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:Z Y CuiFull Text:PDF
GTID:2178330332495146Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With search engines, e-mail, mini blog and view comments and other short text messages over the Internet within the scope of a large number of the emergence of research related to short texts gradually by the people's attention. The current text classification technology is for many a long text, although the performance is better but because of the short text as a small number of words, a huge number, and most dependent on the network., not necessarily applicable. Internal, short text studies were focused on semantic extension, feature processing, etc., and no special in-depth system.This paper on the scope, characteristics and research of short text carried out a detailed analysis, and the current situation and related research and key technologies are introduced. For the short text feature sparse features and so on, taking into account the traditional segmentation of losing important semantic information because of few vocabulary, we use a "word" as a short text features that, combined with co-occurrence of the concept was proposed based on word co-occurrence feature extraction method. This method is based on the traditional word frequency statistics by adding the text between words in common is the amount of information, making the characteristics of the word to more fully express the semantics of short text messages, through the experiments show that the method can significantly improve the efficiency of the classification of short text.Has proved in many classification algorithms, K-Nearest Neighbor(KNN) and Support Vector Machine(SVM) classification of the best short text. Because the large number of short text, we use KNN classification algorithm and improved. KNN algorithm needs to store up all the training text and test sample for comparison before the classification, has a large amount of computation, so positive and negative domain is proposed based on the KNN classification method. This method of training set in advance a regional breakdown of the categories to determine the type of center domain and the approximate domain of fields, and then according to the distance between the test sample to the center vector of each category, find out the distribution of the sample in each category. KNN algorithm is used only for classification of the sample which in the approximate area of a category to narrow the scope of KNN search , thus improving the speed and accuracy of classification. Meanwhile, to reduce the sample error rate, for the sample on each category boundary region setting boundary parameters when category weight judged, increasing the category weight, which makes the sample classification accuracy of the samples of boundaries fuzzy region increased.
Keywords/Search Tags:Short text, Classification, Word co-occurrence, Approximate domain
PDF Full Text Request
Related items