Font Size: a A A

The Research On Conducting Chemical Domain Text Classifier Based On Hownet

Posted on:2008-05-23Degree:MasterType:Thesis
Country:ChinaCandidate:X Y TangFull Text:PDF
GTID:2178360215479877Subject:Computer applications
Abstract/Summary:PDF Full Text Request
As one part of machine learning,text classification has received special attention continuously.However,as a special kind of text, domain text classification doesn't get enough attention. Meanwhile,the requirement for domain text classification increases steadily.Aiming at the realistic demand,this thesis mainly aims to study on the problems of domain text classification,combining with the text information in chemical domain.This thesis first proposes a representation of Vector Space Model(VSM) of text as well as the steps of algorithms:the process of pretreatment, feature selection,weight calculation and computation of text similarity,and then several common text classification algorithms are introduced such as Bayesian classifier,K-Nearest Neighbor,Support Vector Machine.The evaluation system of text classification algorithm is also introduced in this thesis.Hownet is a semantic knowledge base of common sense, which has two principal concepts: "concept" and "primitive". The "concept" is a description of lexical semantics.The "primitive" is a unit of smallest significance that is used to describe a concept.For domain text classification this thesis constructs the prototype of domain knowledge base using the architecture of Hownet. This thesis also calculates semantics similarity among words based on semantics tags of vocabulary concepts in Hownet.The peculiarity of domain text classification,compared with the common knowledge,is studied in this thesis,and in the process of text classification, special consideration is taken in the domain text classification;during text pretreatment,the domain knowledge base constructed is used for word segmentation to avoid errors that domain words are divided into some single words or lose the original meanings of the word in the process of word segmentation; During feature selection,the computed CHI values are low for the low term frequency. Therefore, the values of domain words are amplified to avoid being filtrated after sorting; Weight of domain words is taken into special consideration in the weight calculation. In view of the peculiarity of text in chemical industry, weight calculation is conducted by certain proportion for the chemical standard, chemical domain words and general words to stand out the weight of significant text classification.This thesis verifies the improvement of these algorithms in text classification by many comparative experiments.Aiming at the problems of large amount of vocabulary entry and curse of dimension, this thesis proposed a new method for dimension reduction based on the domain knowledge base constructed in this thesis,in which the original feature set is divided into several subsets by calculating the semantic similarity of characteristic words,so that the semantic similarity of characteristic words in the identical subset is larger than the different one.Then the weight of all the characteristic items in the identical subset is added and the items with centralized feature words are concentrated as an independent feature to stand out the classified sense of subsets,reduce the dimension of comparison,thereby the precision and performance of text classification are also improved.
Keywords/Search Tags:Text classification, Hownet, Domain knowledge base, Feature Selection, K-Nearest Neighbor algorithm
PDF Full Text Request
Related items