Font Size: a A A

Automatic Classification Based On The Concept Of The Text

Posted on:2003-07-06Degree:MasterType:Thesis
Country:ChinaCandidate:W F SuFull Text:PDF
GTID:2208360092471188Subject:Computer applications
Abstract/Summary:PDF Full Text Request
With the rapid growth of Internet, 1ots of information surges toward us. 1thas been an urgent prob1em on how to manage al1 the information we have gotten.Text Categorization (TC) is an important method man usua11y use to deal with thisprob l em.Thi s paper proposes a new automatic natura1 1anguage text categorizationmodu1e based on concept. Thi s modu1e takes HowweNet as the main source of knowledge,the concepts of words as the bas;is of text categorization. The concepts of wordsare reduced to sememes and the TC is performed in the Classfiab1e Sememe VectorSpace (CSVS). The TC modu1e can be summarized as be1owt the TC system is dividedinto two parts t training part and categorization part. Sememes are divided intoc 1assfiab 1e sememes and unclassfiab1 e sememes. Keywords are extracted from thetext after it has been preprocessed. The keywords are di sambi guated accord ing totheir parts of speech and context. The concepts of keywords are then reduced tosememes according to their definitions in How--Net. As a resu1t, the text isrepresented as a vector in the CSVS after removing a11 unclassfiab1e sememes. Thesimi1ar texts form a c1uster in the CSVS. FOr a new text, it is represented asa vector as above and we find k nearest neighbors with the vectors of the trainingtexts. It is supposed that the maximum category of those k texts is the categoryof the text. 1t has been approved by experiments that the reca11 and the precisionof this TC module are better than those TC modu1es based on keywords. This modu1etakes 1ess ca1culating time and working space and too.This paper puts forward new ideas in three ways. 1. The sememes are dividedinto classfiable and unc1assfiab1e sememes. We a1so propose the princip1e andmethod on how to get classfiab1e sememes. In thi s wny, we can get the most importantdomain attributes of a concept. 2. A1though there are papers use concept torepresent a text, the representations are represented by synonym. Reducing aconcept to sememes can represent the nature of the concept more accurate1y andthe re1evance between concepts more natura11y. As a resu1t, the main idea of atext is represented more accurate1y by sememe. 3. The words disambiguation arefirstly put into use in text categorization. A new disambiguation a1gorithm isput forward in this paper.
Keywords/Search Tags:text categorization, text representation, kNN, How-Net, recall, precision, sememe, classfiable sememe, vector space, vector
PDF Full Text Request
Related items