Font Size: a A A

Study Of Concept-based Text Classification Algorithm

Posted on:2011-10-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y L LiFull Text:PDF
GTID:2178360308963590Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As information technology advances and the increasing popularity of the Internet, humanity is doing a huge project in the information history which is to put the existing information about the real world such as newspapers, periodicals, books, patent documents, into network and the entire network is stacked into an unprecedented super-huge database. How to find and obtain the required information from the vast sea of information space quickly has become one of the most fundamental problems in the new information age. Text classification is based on the text contents and categorizes text automatically which can help people better grasp the text, mine text, and improve the quality of information services. Text classification plays an important role in many text mining and information retrieval systems. Therefore, study of text classification has become a very important issue in the field of data mining and become one of the most important research directions in the field of information processing.The main research study of this thesis: propose a new text classification algorithm which is used in professional fields and called Text Classification Algorithm Based on WordNet Concepts. The primary problem of text classification is how to change text data into mathematical data. Nowadays, most text classification algorithms choose Vector Space Models to represent text. But this method takes a single word as the feature item, which ignores the semantic links between the natural language words and lead to existing of synonyms and polysemy in texts and of course severely reduced the accuracy of text information processing at the same time and also lead to high-dimensional sparse problem. These problems have greatly affected the speed of text classification. Although people sought several methods such as adjusting vector space weight to low dimensions to the solve the above problems, these methods have drawbacks of themselves: the vector space weight adjustment method does not function effectively.It can only increase a very limited text classification performance; reducing dimensions method solves the problem of high-dimensional sparse with large cost.This paper applies natural language processing techniques and results and bases on an English knowledge system which is called WordNet to establish text representation model based on concepts which are taken as text characteristics by introducing concept and concept distance into vector space model from the semantic, conceptual points.Experimental results show that the new algorithm can solve synonym and polysemy problem better while improving the accuracy and speed of text classification.
Keywords/Search Tags:Text Classification, WordNet, Concept, Software Design
PDF Full Text Request
Related items