Font Size: a A A

Research Into Testing Method Of The Large-scale Corpus Segmentation Quality

Posted on:2005-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:L P SongFull Text:PDF
GTID:2168360122988669Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The building of large-scale corpus with high quality has been prior to all others in the area of natural language processing, since the analyse of corpus and automatic knowledge acquisition are growing high recognition. However, there are little research on the testing method of the corpus. The present methods for testing the large-scale corpus segmentation have the following faults:a.It is difficult to exactly estimating the variance of population;b.The sampling quantity is too large to test the corpus segmentation. To solve the given problems, we put forward the testing method based onclustering, which sorts the sample of corpus into many group by clustering them. The main research of this paper includes four parts:a.analyzing the sampling methods for testing the large-scale corpus segmentation and using isodata clustering method to sort the corpus;b.presenting the mode configuration of corpus;c.analyzing the measurement formula for the similarity of samples, and adopting a new measurement formula , in which the distance of the sample vector and the linear correlation between the components of the sample vector are taken into consideration comprehensively;d.presenting evaluation function of the result.The following merits can gain by using the clustering method:a.The sampling quantity can be reduced by using the method to test the segmentation of the large-scale corpus;b.The testing precision can be improved and the variance of population can be estimated more exactly.
Keywords/Search Tags:quality evaluation of the corpus segmentation, mode configuration of sample, hierarchical sampling, sample clustering, evaluation function
PDF Full Text Request
Related items