Font Size: a A A

A Study On Extraction Method Of Contemporary Chinese Common_used Words For Language Engineering Based On Dynamic Circulating Corpus

Posted on:2009-10-12Degree:MasterType:Thesis
Country:ChinaCandidate:C N TangFull Text:PDF
GTID:2178360245451591Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Human society is moving from the industrial society into an information society, and information is the main carrier of natural language, which is used for communicating by human being. Natural language researches how to make computers understand human language and develop the suitable system. The common vocabulary of natural language is used frequently in a national language system, whatever in Chinese language teaching, or in making a dictionary, even in the computer information processing, so the clear conception of Chinese common vocabulary has a far-reaching significance. In a certain period of time, the common vocabulary is not only a relatively closed and open set, but also a dynamic and relatively stable set. Traditional statistical methods, as well as the experience of linguists can't give a correct conception of common vocabulary. The computer technology is applied to the extraction for common vocabulary, that is an automatic extraction for common vocabulary based on DCC, which has its value and significance.That by using the scientific data of "Corpus" to study languages has become an inevitable trend and necessary means in the language study field. This paper is based on DCC of the mainstream newspapers in china. the dynamic and the circulation are the essential character of DCC. "The dynamic" of DCC permeates a language change rule, which is"last contains simultaneity", and "simultaneity contains last". In other words, it not only can provide the language description at the same time, but also can provide the language description in different time."The circulation"of DCC is reflected in the newspaper, which has more columns, more diverse areas, and more coverage of the corpus.Main contents in this paper:1.The classification of the original corpusDesigning a process the author divides the corpus into 10 categories according to the different columns in the newspaper, the classification results appears in table 4-3. 2.The format conversion of the original corpusThe format of original corpus is HTML \ HML, and it should be transformed into a XML file which has its own field classification, its own media, year and month. Meanwhile clean the useless information in the format of the Web and only retain the effective information content. After the conversion, the format of document is XML.3.The segmentation, depositing of the text file into the databaseThe author cuts the word text file into the segmentation by the field classified / media / year and month and puts the segmentation whose unit is word into the database for further processing, the database software used in the experiments is SQL Server7.04.CheckUsing self-developed artificial proofing system (developed by java language), the author checks and corrects the inevitable mistakes in above procedures, lets results much more scientific and more accurate.5.The statistics of vocabularyCalculate the "the frequency" "the usage" and " the circulation" of each word in a month. The software used in the experience is Microsoft excel 2003.6.Extraction of the common vocabularyPutting the vocabulary in descending order according to "the common vocabulary usage Ok" in a year the author extracts the common vocabulary; the words can cover 85-95% of all the words in the corpus terms.
Keywords/Search Tags:Dynamic Circulation Corpus, Common vocabulary, Natural language processing, Java
PDF Full Text Request
Related items