Research And Improvement On Automatic Construction System For Text Categorization Corpus

Posted on:2012-01-21

Degree:Master

Type:Thesis

Country:China

Candidate:Y Z Li

Full Text:PDF

GTID:2178330335952451

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The two of corpus and Natural Language Processing (NLP) are complementary to each other, the corpus is the base of many NLP applications using statistic language model, and its construction and application is one of the important topics in NLP. At present, our country has made quite a number of important achievements in construction and application of Chinese corpus, and some research results are applied to the Chinese text categorization. However, with the development of information processing technology, more and more NLP applications require a lot of Chinese text classification corpus with high specialization, but the traditional method of creating computer corpus can not completely meet the requirement in timeliness and specialty, thus the construction of Chinese text classification corpus has been one of important research issues.In this thesis, an automatic construction system of Chinese text classification corpus has been investigated theoretically and improved effectively, and the main research work as follows:1. Studied the theory and technology of computer corpus, analyzed an automatic construction system of Chinese text classification corpus, including its design thought, implementation procedure and methodology, and then presented some optimization ideas base on the detailed analysis to the prototype system.2. Proposed and implemented an approach for content information extraction of webpage using its density features. The method firstly parses and partitions the web pages into textual blocks, then calculates the value of their specific density features, and finally uses C4.5 decision tree algorithm to construct a classification model of textual blocks. With the classifier, the content information of web pages can be easily and properly extracted by identifying their content textual blocks.3. Introduced the related technology of webpage de-duplication, described the representative webpage de-duplication approaches briefly with focus on their unique characteristics, and then deals with an improved webpage de-duplication approach base on the Shingling algorithm. The improved method firstly extracts the webpage's content information as text document, and then represents the text document as a set of unique contiguous subsequences of notional words, finally roughly classifies the text documents using the radio of the set's element number so as to avoid the needless similarity computation and improve the performance.4. Implemented the optimization ideas by applying the content information extraction method and the improved webpage de-duplication approach to the automatic construction system of Chinese text classification corpus.Experiments show that the Chinese text classification corpus which constructed by the improved automatic construction system is more accuracy and can show good performance in the application of text categorization.

Keywords/Search Tags:

Content Extraction, Webpage De-duplication, Corpus, Web Data Mining

PDF Full Text Request

Related items

1	Webpage Text Extraction And Bilingual Website Detetion Based On Multi-feature Fusion
2	Web-oriented Multilingual Parallel Sentence Pairs Mining Techniques
3	Research And Implementation Of Bilingual Corpus Mining On The Internet
4	Framework For Domain-oriented Webpage Content Extraction And Semantic Label Generation
5	Webpage Data Automatic Extraction Technology
6	Research On Web Filtering Method Of People Information
7	Research Of Data De-duplication Based On Mobile Terminals
8	Research On The Method Of Constructing Chinese And Vietnamese Comparable Corpus Based On
9	Design And Implementation Of Content-based Webpage Collection And Classification System
10	On The Design And Implementation Of Automatic Webpage Classification Algorithm