Font Size: a A A

Research On Chinese Short Text Subject Classification

Posted on:2015-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:H T LiFull Text:PDF
GTID:2268330428477218Subject:Computer software and theory
Abstract/Summary:
Classifying short text such as the query string of search engine and the description string of picture content, the common practice is that, the traditional classification approach of machine learning is optimized by feature extension, and in this way to improve the effect of short text classification. However, this method still exists some shortcomings as follows: Performance overhead is large, degree of parallelization is low; the classification accuracy is not high; difficult to process quickly-updated information; the lack of a accurately-marked large corpus in model training. To this end, this dissertation uses a topic classification method based on rules, which can effectively solve the shortcomings of the traditional methods.The main research work is as follows:1)Through the analysis of the grammar system and processing method of syntactic analysis technique, dependency parsing based on statistical methods is the current mainstream syntactic analysis technique. This dissertation uses this technique as the basis for syntactic analysis, does a certain degree of improvement of word segmentation module and named entity annotation module, and makes this modules meet the demand of topic classification. On this basis, this dissertation introduces the concept of semantic block, the decision rules of semantic block and the decision rules of semantic block topic.2)This dissertation analyses the related techniques of taxonomy and subject indexing approach, taking into full account the demand of topic combination on the basis of the integration of classification and subject, this dissertation creates a level-two subject classification table of internet open-domain, and then uses the mixed method of inverse filtering+TF-IDF+artificial judgement to create a high-precision topic dictionary. This dissertation labels the word segmentation results through the topic dictionary, and then according to the rule which the topic-labeled results and the syntactic analysis results meet, this dissertation can directly determine the topic of short text.3)According to the above analysis, this dissertation implements a topic mining system based on the search logs. The system implements the rule-based topic classification and the model classification based on statistical theory, and according to the practical application, analyzing the differences between the two methods from the following four aspects: performance, degree of parallelism, adaptability to the updating of data, demand of correctly-labeled corpus, the validity of our method is further proved.
Keywords/Search Tags:syntactic analysis, topic dictionary, the topic of question object, classification rule
Related items