Research And Implementation Of Chinese Text Categorization System Based On Semantic Similarity

Posted on:2008-05-02

Degree:Master

Type:Thesis

Country:China

Candidate:Z Zhang

Full Text:PDF

GTID:2178360242972551

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Text categorization is such a procedure that it analyzes classification objects, retrieves their features and compares these objects with predefined characteristics which have been defined in classification system, then, assigns them to the category which is most similar to them and give corresponding category number. It's as text mining technology's foundation and core and is also a basic work in the area of data mining.Feature Extraction and Text expression is the key technology in text classification. Generally ,in the traditional text classification system , there is a assuming that the relation between words in article is linear independence and inter-relationship between feature dimensions in the vector space model is orthogonal, however ,in face there are various relationships in the text context , such as synonymy , similarity , conjunction etc. Using these relationships and similarity between words, keywords are mapped into the concept of space and concepts representing specific words are used to classify text. Therefore, many words with very high similarity are converted to a concept, and a polysemantic word also is mapped into different concepts in different context. By using such method, the coacervation of feature is improved, the limitation of classification method based on keywords is overcome, and dimensions disaster and precision respectively is reduced and improved.In order to loosen the coupling among different parts in the feature extraction model, database is introduced into the system. Based on this model, many statistics and calculations related to category, article and word are performed very effectively. Further more , system can change flexibly algorithms in the feature extraction model and take experiments to compare different algorithms.In this paper, the "Hownet" and "TongYiCiLin" dictionaries are introduced into the semantic model and used to map the feature from the keywords space into the concept space, realizing the text classification system based on concept similarity. In the procedure of semantic model, according to characteristics of two dictionary , the various characters hash indexing mechanism is used to construct them in the computer and optimize space and improve precision, too. In the semantic processing on article, Support Vector Machine (SVM) is used to study. In the training test about 2000 Chinese texts of 10 classes are collected. And to test the classifier, about nearly 1000 texts were used. The training and categorization tests show a good result for this system.

Keywords/Search Tags:

Text Categorization, Support Vector Machine (SVM), Feature Extraction, Hownet, TongYiCiLin

PDF Full Text Request

Related items

1	Modeling And Implementation Of Chinese Text Categorization System Based On SVM
2	A Study On Text Categorization Based On Machine Learning
3	The Research And Implementation Of Chinese Text Categorization
4	Implementation Of Chinese Text Categorization System Based On SVM
5	Study On Text Categorization Method Based On Support Vector Machine
6	The Research On Text Categorization Algorithm Based On Support Vector Machine
7	Research On Chinese Text Categorization Based On Support Vector Machine
8	The Research And Implementation Of Automatic Text Categorization For Chinese Web Documents
9	Support Vector Machine Application In Text Categorization
10	Application For Web Text Categorization Based On Support Vector Machine