Font Size: a A A

Research And Implementation Of Chinese Training Corpora Automatic Acquisition Method Based On Google Web API

Posted on:2009-06-11Degree:MasterType:Thesis
Country:ChinaCandidate:S Y ZhangFull Text:PDF
GTID:2178360245954995Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays, with the great development of Internet, information resources have became an important approach for knowledge-acquisition source in our daily life, study and working. So, how to deal with a vast number of Web information has became an important research subject in information processing field. The automatic classification on web pages is an effective way. However, most researchers have drawled conclusions based on testing training corpora build by themselves. Quality of categorization algorithm is related to training corpora and improving quality training corpora, which can make classifier gain a better performance of classify.In this dissertation, the author focus on the automatic acquisition method of training corpora, though there is a little research on training corpora, it is deserved research for these reasons:Firstly, if the process of training corpora building up can be achieved automatically, particularly in automatic training sample acquisition. The speed of establishing training corpora will increase, reduce the workload and speed up the pace of classification.Secondly, users can quickly establish their own needed training corpora which will reduce the entire classification process of manual participation.Finally, the training corpus established is integrated with the existing classification algorithms which have better classified effect, can improve the accuracy of classification the training corpus as well as.For the automatic acquisition of training corpora, the main works of this dissertation are as follows:(1) Compared with traditional method access to training samples by manual, a method based on Google Web API is proposed for collecting training samples. Thus, training samples can be collected quickly and the workload can be reduced.(2) The traditional structure of the training corpora is reconstructed. The category of traditional training corpora is parallel; this dissertation has developed a structure of class hierarchy. And as far as possible, all categories of training corpora are improved. That is, for each parent category, it has one or several sub-categories, type names of which are used as queries, and Google Web API is used to collect network resources, training them as samples to train all levels of categories.(3) By analyzing, the relevant phrases can be used to collect more training samples. The high-quality training samples can be obtained through repeated applications with these proposed methods. And then further to improve the performance of classification. Through experience, the accuracy of classification can be improved by using the training corpus established in this way.In short, the main content of this dissertation is the automatic acquisition of training corpus. At last, the deficiencies in research are concluded and further prospects are discussed for the future research.
Keywords/Search Tags:Automatic Classification, Automatic Acquisition of Training Corpora, Google Web API
PDF Full Text Request
Related items