Research And Implementation Of Chinese Training Corpora Automatic Acquisition Method Based On Google Web API

Posted on:2009-06-11

Degree:Master

Type:Thesis

Country:China

Candidate:S Y Zhang

Full Text:PDF

GTID:2178360245954995

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Nowadays, with the great development of Internet, information resources have became an important approach for knowledge-acquisition source in our daily life, study and working. So, how to deal with a vast number of Web information has became an important research subject in information processing field. The automatic classification on web pages is an effective way. However, most researchers have drawled conclusions based on testing training corpora build by themselves. Quality of categorization algorithm is related to training corpora and improving quality training corpora, which can make classifier gain a better performance of classify.In this dissertation, the author focus on the automatic acquisition method of training corpora, though there is a little research on training corpora, it is deserved research for these reasons:Firstly, if the process of training corpora building up can be achieved automatically, particularly in automatic training sample acquisition. The speed of establishing training corpora will increase, reduce the workload and speed up the pace of classification.Secondly, users can quickly establish their own needed training corpora which will reduce the entire classification process of manual participation.Finally, the training corpus established is integrated with the existing classification algorithms which have better classified effect, can improve the accuracy of classification the training corpus as well as.For the automatic acquisition of training corpora, the main works of this dissertation are as follows:(1) Compared with traditional method access to training samples by manual, a method based on Google Web API is proposed for collecting training samples. Thus, training samples can be collected quickly and the workload can be reduced.(2) The traditional structure of the training corpora is reconstructed. The category of traditional training corpora is parallel; this dissertation has developed a structure of class hierarchy. And as far as possible, all categories of training corpora are improved. That is, for each parent category, it has one or several sub-categories, type names of which are used as queries, and Google Web API is used to collect network resources, training them as samples to train all levels of categories.(3) By analyzing, the relevant phrases can be used to collect more training samples. The high-quality training samples can be obtained through repeated applications with these proposed methods. And then further to improve the performance of classification. Through experience, the accuracy of classification can be improved by using the training corpus established in this way.In short, the main content of this dissertation is the automatic acquisition of training corpus. At last, the deficiencies in research are concluded and further prospects are discussed for the future research.

Keywords/Search Tags:

Automatic Classification, Automatic Acquisition of Training Corpora, Google Web API

PDF Full Text Request

Related items

1	Automatic acquisition of lexical semantic knowledge from large corpora: The identification of semantically related words, markedness, polarity, and antonymy
2	Acquiring Commonsense Corpora From Large Scale Web Corpora
3	Study On The Theory & Practice Of Automatic Indexing Of WWW Science And Technology Information Resources
4	Research On Web Acquisition And Automatic Classification Of Massive Text Information
5	Study On Automatic Classification Method For Remotely Sensed Imagery By Incorporating Spatial-Spectral Features
6	Sensiment Classification Of Micro-blogs Corpus Based On Automatic Annotation Training Set
7	Automatic Content Labeling System For Broadcast News
8	Research On Key Technologies Of Automatic Classification And Assignment Of Software Bug Reports
9	Design And Development On Classification Training Corpus Management System
10	Research Of Automatic Targets Acquisition Based On Interacting Multiple Model