Font Size: a A A

Research And Implementation On Automatic Construction System For Text Categorization Corpus

Posted on:2010-01-08Degree:MasterType:Thesis
Country:ChinaCandidate:W WuFull Text:PDF
GTID:2178360275451806Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Large-scale corpus contains abundant language phenomenon. It can reflect the universal law of language using and has drawn the interest of many countries in the field of information technology and linguistics circle. It has become a hot topic in the field of natural language processing. Particularly, with the development of Statistics-rule Based methods, Corpus has become the core and foundation of the research. But at present Chinese corpus is rare. In the field of text categorization, all kinds of Chinese categorization Corpus are especially rare. Today the text categorization has become the core and foundation of large-scale data processing applications. The lagging of Corpus research has become the obstruction of information technology development.At present, the method of creating computer Corpus is organizing experts in various fields and selecting corpus that satisfies the requirement of Corpus from a large quantity of knowledge. This process requires a large amount of human and material resources. And the creation of the corpus is usually associated with the level of the participated experts with a certain subjective characteristics. At the same time, with the development of natural language processing, all kinds of professional and vertical Corpus is extremely rare. Therefore in order to reduce the costs and human participation and shorten the time of creating Corpus, this thesis proposes and realizes a kind of algorithm on automatic constructing corpus for Chinese text categorization based on the analysis of existing Corpus. Including:1. Designed and implemented a system of automatic constructing Chinese text classification Corpus. This system can automatically crawl the Web pages on the Internet, deal with them, extract the main content, get the core words and control the scale of Corpus.2. Proposed and implemented an algorithm of automatic recognizing and unifying the Web coding. This algorithm can recognize the encoded mode of the downloaded Web pages. At the same time, it transforms all the Web pages' encoded mode to a kind of manageable mode. And the module can be easily applied to all kinds of Web data processing procedures.3. Analyzed the structure of the downloaded pages and implemented a method of extracting the body information. This method can process the pages and extract the topic-relative information of the pages. 4. Proposed the concept of category core words and realized the algorithm of getting category core words at the same time. Category core words are obtained through this algorithm. After sorting these category core words by its importance, we can extend the scale of Corpus by these core words and the name of category.Experiments show that the system can construct computer text category Corpusautomatically, and the Corpus constructed by this system performance well invarious classifiers. It has a certain practical value.
Keywords/Search Tags:Text categorization, corpus, Web Data Mining
PDF Full Text Request
Related items