Font Size: a A A

A Study On Recognition And Extrction Method Of Contemporary Chinese Basic Vocabulary Based On Dynamic Circuit Corpus

Posted on:2008-10-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:X B ZhaoFull Text:PDF
GTID:1115360215481081Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
From the view of a nation language system, vocabulary is the essential carrier of language information, it is the most active and vibrant element of a language system. Without vocabulary, phonetics can not function, neither does grammar. Meanwhile, each language is not static; it constantly changes when in use, so each language is a dynamic and living language, which can be regarded as a "language ecosystem". In a language system, changes in social life are always reflected in the usage of day-to-day vocabulary. It is clear that vocabulary is the most changeable and rapid part of three elements of a language.Words of basic vocabulary are these words which are used in daily life, not easy to change and more solid ones. They generally have strong capacity of forming new words and are the basis of derivations. Basic vocabulary is the core of whole vocabulary of a language, which plays an important role in language teaching, dictionary compilation, and information processing. However, since three classical principles of basic vocabulary are very broad and unclear such as "general usage in a nation, solid status in history, strong capacity of forming new words", it is hard to fix it accurately. Moreover, under the circumstance of limited progress of computational linguistics, most studies on basic vocabulary are made by examples given by linguists. Few quantitative researches were done till now.To fix basic vocabulary quantitatively, the paper gives the clear conception of CBVE (Contemporary Chinese Basic Vocabulary for Language Engineering) and CCWE (Contemporary Chinese Common-used Words for Language Engineering) drawn from DCC(Dynamic Circuit Corpus). On the basis of this job, the task of this paper is to study the method of automatic recognition and extraction for CBVE.The object studied in the paper vocabulary appearing in six mainstream newspapers from 2002 to 2006 named 'People's Daily', 'Beijing Youth Daily', 'Beijing Evening News', 'Legal Daily', 'Global Times', 'Yangcheng Evening News'. All newspapers mentioned above are drawn from DCC in the National Language Resource Monitoring and Studying Center. The calculation formula of general usage of words is put forward in the paper first, CCWE are gotten with the help of the formula then. In the context of CCWE, statistic features of CBVE made by linguists are observed. Automatic identification and extraction model is set up according to genetic algorithm dealing with the characteristics of feature vector of basic vocabulary. Study aim of this paper is that the automatic method of words recognition and extraction provide a quantitative research way for Contemporary Chinese Basic Vocabulary for Language Engineering.Main contents in this paper:Pretreatment of wordsStudied materials should be transformed from HTML files to TXT files.Field Classification of textsTexts should be classified into many fields in order to calculate the general usage of words. Ten fields are defined in the article such as politics, economy, education, etc.Feature description and automatic extraction of CCWEAccording to the definition of CCWE, the features of general usage of words are observed first, the calculation formula is fixed to implement the automatic extraction of CCWE then.Construction of the priori Set of CBVEWith the help of priori set of basic vocabulary made by linguists, features of the CBVE set can be used to set principles of model of CBVE automatic identification and extraction.Choosing feature vector of CBVEAccording to statistic feature of CBVE, such as being common used, steady and productive, feature vector of CBVE is fixed.Setting up training set of CBVEThe priori set of CBVE is divided into some categories in the way of clustering. Then manual work is added in order to make training set. This is the fundamental step in the paper.Training the model of automatic identification and extraction with the method of genetic algorithmIn the training set, genetic algorithm is adopted to train parameters for model of automatic identification and extraction for CBVE. If the set of CBVE is stable, the training set is successful.Contrast analysis of automatic identification and extraction for CBVETo measure the performance of the model, contrast analysis is necessary.Analysis and research of CCWE and CBVE Some related researches about CCWE and CBVE are carried out.In this paper, research innovations and major contributions are as the following:In the large-scale dynamic corpus, real usage of vocabulary in mainstream newspaper is observed. Large corpus, which includes 632,255 texts and 247, 257, 749 word-counts, and 8, 750, 105 different Chinese Words is studied.A new method is put forward on the first time to study quantitatively features of CBVE. Hence, it is an important way to study basic vocabulary from qualitative way to quantitative way.A new calculation method of general usage of vocabulary is given, which provides a new measurement for studying features of vocabulary.In the light of method in the field of pattern recognition, the paper applies genetic algorithm into the model of automatic recognition and extraction for CBVE. Since genetic algorithm has advantages of broad space, fast convergence and strong robust in searching feature vector, the final result is wonderful.
Keywords/Search Tags:Contemporary Chinese Basic Vocabulary for Language Engineering (CBVE), Contemporary Chinese Common-used Words for Language Engineering (CCWE), Dynamic Circulation Corpus (DCC), Genetic Algorithm, Quantitative Research, Pattern Recognition
PDF Full Text Request
Related items