Font Size: a A A

A Class Core Extraction Method For Text Categorization

Posted on:2011-05-08Degree:MasterType:Thesis
Country:ChinaCandidate:J X ZhangFull Text:PDF
GTID:2178360305490589Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the exponential growth of electronic documents, technology of automatic text categorization is gaining more and more attention. Automatic text categorization which categorizes the document to corresponding class automatically according to its content can help people organize electronic documents and dig valuable information more quickly and more efficiently. In recent years, research in automatic text categorization has witnessed a fast speed of development and many new technologies and methods emerged. However, these technologies and methods still face many difficulties when put it in large-scale applications. Still many issues are worth studying.From the research route point of view, text categorization approaches can be divided into two kinds:empirical approach and rationalistic approach. A typical representative of the former is text categorization based on machine learning, which is currently the mainstream method, while the latter is represented by text categorization based on concept. Analyzed strengths and weaknesses of both comprehensively and inspired by cognitive processes of manual categorization, we propose a new method for text categorization called Class Core Extraction (CCE). This method can be viewed as an organic combination of the two routes, using a rationalist approach to build categorization framework while empirical one to obtain the knowledge of categorization by automatic machines. The main idea of class core extraction is that:in natural language, the term is used to express the concept, so if this kind of words which include categorization information can be found, by constructing an unique set for each category (i.e. class core) to use words which have the characteristics of this nature, the computer also can categorizes texts according to the content by using class cores as a guide automatically. The dissertation designs two methods for extraction:experience method and the center of a circle method. An indicator which is called term contribution power is defined. The indicator is the only measure for terms to enter class cores, which reflects the ability of terms for categorization and the degree of terms including class information, and also a comprehensive measure of terms'frequency distribution both within category and among categories. Term contribution power is not only a criterion to select class core terms but also a kind of knowledge to be remained for follow-up categorization. We designed a categorization algorithm called lottery algorithm based on class cores, which is a special operation of sets in essence.Compared with the traditional categorization model, class core extraction model is more concise. In experimental system designed for this method, we compared our method with four kinds of feature selection methods and two kinds of classical categorization algorithms. The results show that this method witnesses a stable trend in performance and a great advantage of categorization speed, which is a good balance between the two main indicators.
Keywords/Search Tags:text categorization, class core extraction, term contribution power, lottery algorithm
PDF Full Text Request
Related items