Font Size: a A A

Research And Improvement On Automatic Construction System For Text Categorization Corpus

Posted on:2012-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z LiFull Text:PDF
GTID:2178330335952451Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The two of corpus and Natural Language Processing (NLP) are complementary to each other, the corpus is the base of many NLP applications using statistic language model, and its construction and application is one of the important topics in NLP. At present, our country has made quite a number of important achievements in construction and application of Chinese corpus, and some research results are applied to the Chinese text categorization. However, with the development of information processing technology, more and more NLP applications require a lot of Chinese text classification corpus with high specialization, but the traditional method of creating computer corpus can not completely meet the requirement in timeliness and specialty, thus the construction of Chinese text classification corpus has been one of important research issues.In this thesis, an automatic construction system of Chinese text classification corpus has been investigated theoretically and improved effectively, and the main research work as follows:1. Studied the theory and technology of computer corpus, analyzed an automatic construction system of Chinese text classification corpus, including its design thought, implementation procedure and methodology, and then presented some optimization ideas base on the detailed analysis to the prototype system.2. Proposed and implemented an approach for content information extraction of webpage using its density features. The method firstly parses and partitions the web pages into textual blocks, then calculates the value of their specific density features, and finally uses C4.5 decision tree algorithm to construct a classification model of textual blocks. With the classifier, the content information of web pages can be easily and properly extracted by identifying their content textual blocks.3. Introduced the related technology of webpage de-duplication, described the representative webpage de-duplication approaches briefly with focus on their unique characteristics, and then deals with an improved webpage de-duplication approach base on the Shingling algorithm. The improved method firstly extracts the webpage's content information as text document, and then represents the text document as a set of unique contiguous subsequences of notional words, finally roughly classifies the text documents using the radio of the set's element number so as to avoid the needless similarity computation and improve the performance.4. Implemented the optimization ideas by applying the content information extraction method and the improved webpage de-duplication approach to the automatic construction system of Chinese text classification corpus.Experiments show that the Chinese text classification corpus which constructed by the improved automatic construction system is more accuracy and can show good performance in the application of text categorization.
Keywords/Search Tags:Content Extraction, Webpage De-duplication, Corpus, Web Data Mining
PDF Full Text Request
Related items