Font Size: a A A

Research On Segmentation Consistency Checking Technology Of The Large-scale Chinese Corpus

Posted on:2006-08-16Degree:MasterType:Thesis
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:2168360155956979Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the field of Chinese natural language processing, it pays attention to research and obtain knowledge automatically based on large-scale true text corpus, and it becomes primary task that building the large-scale and high-quality corpus. But because of the setting-up of the corpus now, it is need to check-up artificially, the unavoidable carelessness and mistake will cause inconsistent to the same word segmentation results under the same language environment. These inconsistent phenomena have not merely influenced the corpus precision, and take mistake to next process step that utilize corpus resource. So, while processing the corpus, we must check and collate the segmentation consistency to guarantee the corpus quality. So it is important standard that evaluate segmentation corpus quality.The thesis aims at the question that segmentation consistency of the large-scale corpus, firstly adopts the segmentation consistency collation method on the basis of rule and on the basis of support vector machine to analyze testing corpus separately, then adopt the combined method to test again. The method checks and collates inconsistent segmentation word automatically make use of the method based on rule and SVM to corpus inconsistent word, and experiment makes anticipated goal, proves that the combined method can rather deal with the segmentation consistency problem. The groundwork is as follows:1. Study and analyze large-scale corpus segmentation inconsistency phenomenon and reason, count the corresponding proportion, and confirm the research object of the thesis;2. Provide structured express form of corpus sample that make use of main factor that influence segmentation precision as the vector characteristic of corpus sample;3. Obtain the example of segmentation from the artificial checked correct language material, count and get the necessary experimental data on the basis of the example that is obtained, according to support vector...
Keywords/Search Tags:Segmentation Consistency Check, Auto-collation, Support Vector Machine, Environment of the Context
PDF Full Text Request
Related items