Font Size: a A A

Chinese Text Feature Extraction And Classification Based On The Semantics Association

Posted on:2013-07-05Degree:MasterType:Thesis
Country:ChinaCandidate:K XuFull Text:PDF
GTID:2248330374997882Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Accompanied with fast development of network technology, tremendous data and information spread rapidly through the internet. It is more and more popular to use paper-less office system and electronical file to execute affairs. Confronting with such huge amount of data, it is urgently to promote solutions for dealing with so abundant information efficiently and accurately. Under this circumstance, text classification technology provides an appropriate method for such problems.This paper analyzes the development of text classification at home and abroad, especially the text classification technology for Chinese texts. The current text classification technology which has been applied to English texts has rewarded relatively desired results. However, due to the unique feature of Chinese texts, it is not appropriate to apply the present text classification technology to Chinese texts directly. As for the research vacancies for the current study, it is observed that there are some disadvantages in previous research in which the semantic relationships have been ignored. In this dissertation, by applying comprehensive analysis and summary, it is found that the semantic factors should be applied throughout the whole classification process for Chinese texts. After applying preprocess for target text, it is fully considered to utilize the absolute superiority of semantic association for natural language to construct connections among words. Additionally, an enhanced TFIDF algorithm is proposed to redefine the weighted formulae without missing any small probability but important words for classification. This improved algorithm can extract more accurate feature words for text classification in the next stage. Based on the improved feature extraction algorithm, this dissertation selects the traditional KNN classification algorithm which is common and easy to achieve to improve the capability for dealing with semantic analysis. The new algorithm sellects the extracted and grouped feature words during the feature extraction. Every group of words has semantic relationships with each other, which is subjected to the same category and used to establish class feature sets. Then they are used to calculate the imilarity between the test text and the class feature sets. Thereby, the process of calculating the similarity of test text with all training texts is eliminated, which greatly reduces computation. Meanwhile, this paper sorts the test text and feature sets by similarity in order to select the likely category, and then narrows the number of categories which are need to be determined, and finally realizes Chinese text classification. Experimental results illustrate that the enhanced algorithm can significantly improve the accuracy rate, recall rate for text classification and operating time, which indicates that effectiveness of the improved algorithm.The results of this study strongly demonstrate that the importance of semantic factors in the feature extraction and classification algorithms. By applying the advantages of the word intrinsic semantic relationships in feature extraction and implementing semantic into x2statistics formula, it can be also observed that the developed algorithm can improve the accuracy of the extraction, and greatly reduce the amount of computation compared with traditional KNN classification algorithm. Hence, the relavant research in this paper is significantly meaningful for practical experiments.
Keywords/Search Tags:text classification, feature extraction, semantic association, information gain, TFIDF, χ~2statistics, KNN classification
PDF Full Text Request
Related items