A Study On The Inconsistent Word Segmentation Of Middle Ancient Chinese Corpus

Posted on:2017-07-04

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Wang

Full Text:PDF

GTID:2355330491956246

Subject:Chinese Philology

Abstract/Summary:

PDF Full Text Request

In the field of language and words research, Corpus based on the large-scale authentic text is playing an increasing role, the construction of large-scale high quality corpus is the basic work. However, due to the disunion or even lack of segmentation criteria in the construction of current corpus, and the unavoidable carelessness during the manual segmentation, tagging and verification, it will inevitably cause inconsistent to the same word segmentation results under the same language environment. This phenomenon is more prominent in the construction of Middle ancient Chinese Corpus, which not only affects the accuracy rate of corpus segmentation, but also brought the error to further processing and using of the Corpus. Therefore, Segmentation Consistency should be one of the important criteria to measure the quality of the corpus.This paper firstly makes a brief introduction of the corpus of the middle ancient Chinese, then aimed at the problem of word segmentation in the Middle Ancient Chinese Corpus, We figure out the inconsistency words by program statistics, make a classification study by the linguistic point of view, and develop the segmentation criteria. Using the method of multiple features combination to improve the accuracy and consistency of the segmentation results. Through counting the inconsistent words in Middle Ancient Chinese Corpus, combined with its context field, this method find out the special reasons of causing inconsistent, and formulate specific standard of word segmentation for these words, and then solve the inconsistent problem in the experiment corpus by using the method of manual proofreading. Based on the CRF software, we introduced a variety of segmentation feature, especially the dictionary feature, to achieve the word segmentation method of multi feature combination. And finally determine the best features and template during the contrast experiment. The experimental results reached its goals, the closed test gets a more than 99% accuracy, and open test gets a more than 90% accruacy, showing that the method based on the standard of Middle Ancient Chinese word segmentation and the combination of statistics and dictionary can better solve the word segmentation problem. The main work is as follows:1. This paper extract more than 280,000 manual processing words from the Middle Ancient Chinese Corpus, then use the written program to sout out the words block which may exist segmentation inconsistency, classify these words block by manual checking and find out the the real word segmentation inconsistency and its context, and meanwhile determine the research object of this paper.2. We analyzed the unique cause of the inconsistency in the Chinese word segmentation from the perspective of linguistic, then classified the inconsistency by the point of language structure, revealed the reason why these inconsistency accur. And based on this word, we formulate specific standard of word segmentation for these words.3. Based on the standard, we arranged the 280,000 manual processing words which is used as the training data of CRF, to solve the problem of words segmentation inconsistency and increase the accuracy of words segmentation. During the experiment, we introduction character type, tone, vowel, initial consonant and word dictionary mark as four kinds of features in the CRF segmentation software, throught experiment to make sure their contribution for the segmentation results. Then make different segmentation template, determitation the best one by experiment. From which determines the CRF segmentation template and feature.4. Based on the ready training corpus and the determined template and feature, We made a comparative experiment, and the experimental results meets my expectation, the closed and open test get more than 99% and 90% accuracy respectively. And then we analied the results.On the basis of statistics, we made a detailed study of the Middle Ancient Chinese word segmentation inconsistency, and formulate the specific standard of word segmentation, meanwhile, put forward the CRF segmentation strategy based on multi features and achieved good results. During the open and closed test of Middle Ancient Chinese Corpus, we all achieved satisfied effects. It is obviously that this method can effectively improve the quality of words segmentation in the Middle Ancient Chinese.

Keywords/Search Tags:

Segmentation Inconsistency, Middle Ancient Chinese, CRF, Dictionary, Statistics

PDF Full Text Request

Related items

1	Experimental Study On The Fusion Of Dictionary Segmentation And Model Word Segmentation In Chinese
2	Ancient Chinese Character Image Segmentation Based On IVHFS And IDFA
3	Research On The Integrated Processing Technology Of Sentence Segmentation And Lexical Analysis Of Ancient Texts Based On Deep Learning
4	Research On Automatic Texts Segmentation And Word Segmentation For Ancient Chinese Texts
5	The Research On Tibetan Automatic Word Segmentation Technology
6	An Analysis Of New Words And Metaphors In The New Age Russian - Chinese Dictionary
7	The Reseach On The Use Of Chinese Dictionary By Overseas-students From The Middle-asia In Xinjiang
8	The Quantitative Analysis Of The Addition And The Deletion Of The 5ï½ž(TH) Edition Of "Modern Chinese Dictionary"
9	Research On Image Segmentation For Virtual Color Restoration Of Ancient Murals
10	A Contrastive Study Of The 5th And 6th Edition’s Homonyms Of The Modern Chinese Dictionary