Font Size: a A A

Research And Implementation Of Chinese Text Categorization System Based On Semantic Similarity

Posted on:2008-05-02Degree:MasterType:Thesis
Country:ChinaCandidate:Z ZhangFull Text:PDF
GTID:2178360242972551Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Text categorization is such a procedure that it analyzes classification objects, retrieves their features and compares these objects with predefined characteristics which have been defined in classification system, then, assigns them to the category which is most similar to them and give corresponding category number. It's as text mining technology's foundation and core and is also a basic work in the area of data mining.Feature Extraction and Text expression is the key technology in text classification. Generally ,in the traditional text classification system , there is a assuming that the relation between words in article is linear independence and inter-relationship between feature dimensions in the vector space model is orthogonal, however ,in face there are various relationships in the text context , such as synonymy , similarity , conjunction etc. Using these relationships and similarity between words, keywords are mapped into the concept of space and concepts representing specific words are used to classify text. Therefore, many words with very high similarity are converted to a concept, and a polysemantic word also is mapped into different concepts in different context. By using such method, the coacervation of feature is improved, the limitation of classification method based on keywords is overcome, and dimensions disaster and precision respectively is reduced and improved.In order to loosen the coupling among different parts in the feature extraction model, database is introduced into the system. Based on this model, many statistics and calculations related to category, article and word are performed very effectively. Further more , system can change flexibly algorithms in the feature extraction model and take experiments to compare different algorithms.In this paper, the "Hownet" and "TongYiCiLin" dictionaries are introduced into the semantic model and used to map the feature from the keywords space into the concept space, realizing the text classification system based on concept similarity. In the procedure of semantic model, according to characteristics of two dictionary , the various characters hash indexing mechanism is used to construct them in the computer and optimize space and improve precision, too. In the semantic processing on article, Support Vector Machine (SVM) is used to study. In the training test about 2000 Chinese texts of 10 classes are collected. And to test the classifier, about nearly 1000 texts were used. The training and categorization tests show a good result for this system.
Keywords/Search Tags:Text Categorization, Support Vector Machine (SVM), Feature Extraction, Hownet, TongYiCiLin
PDF Full Text Request
Related items