Font Size: a A A

Desigh And Implement Of Parser Based On Grammar Function And Collocation

Posted on:2017-05-20Degree:MasterType:Thesis
Country:ChinaCandidate:S YanFull Text:PDF
GTID:2335330518480087Subject:Information Science
Abstract/Summary:PDF Full Text Request
"Humanities Computing" is an interdisciplinary research field that merges modern information technology into the traditional humanities and social sciences.With the rise of the "Internet+" concept and digital wave continues to advance,"Humanities Computing"has become an important research topic in information science,linguistics and Chinese information processing field.In recent years,intelligence linguistics has emerged many new projects in humanities computing field such as digitization of ancient books and the construction of ancient Chinese literature corpus resource etc.At present modern Chinese related information processing studies have been deep into the chapter level,but the ancient Chinese related information processing mostly still at the stage of word processing.Thus,there is a very important practical significancethe for us to improve the existing ancient Chinese information processing system as we explore the relevant knowledge of ancient Chinese vocabulary grade level.The ancient literature in this article refers to the ancient Chinese literature of the Pre-Qin Dynasty.During the study,we apply relevant knowledge to conduct the ancient Chinese literature vocabulary grade level knowledge mining work in areas such as corpus,humanities computing,statistical machine model,complex networks,and other areas.The overall goal of this thesis is to carry out knowledge mining research of Pre-Qin Chinese vocabulary grade level with the integration of digital humanities thought.Using a variety of methods of intelligence within the field of linguistics,the research results on the one hand can help people to explore the history and laws of the Chinese language,on the other hand they can help information processing in ancient Chinese and service for information and knowledge discovery.This study is carried out based on ancient Chinese corpus that is constructed by 25 representative Pre-Qin literature.The main contents include the following three parts:first part,to carry out the annotation and construction study of ancient Chinese literature corpus,this section introduces the basic situation of these 25 Pre-Qin ancient literature,and then it introduces the knowledge of the ancient Chinese corpus,the ancient Chinese word segmentation,the ancient Chinese POS tagging and named entity recognition respectively,finally,to select the part of the ancient literature from the corpus and conduct a simple word distribution statistics,then analyze its internal law;second part,to conduct the studies of ancient Chinese word segmentation and speech training model based on CRFs model and ancient Chinese language rules combined methods,this section first introduces the CRF model related knowledge,how to preprocess the ancient Chinese language material and how to select the feature template,then design two comparative experiments with closed and open testing based on CRFs model for the ancient Chinese corpus,and carry out other sub-experiments on the basis of these two sets of comparative experiments,finally it displays that the harmonic mean values(F)of the closed test are about 99%,and the harmonic mean values(F)obtained in the open test are about 90%,the two experiments have achieved relatively good results;third part,to use complex networks to carry out research on ancient Chinese vocabulary profile,this section first introduces the concepts of language network and common statistical indicators,then describes three common language network,and in experimental aspects we select part of the ancient literature to test and verify if ancient Chinese literature meets Zipf distribution,and select the relevant ancient literature to build a network of ancient Chinese vocabulary by Pajek software,and to judge whether the ancient Chinese network is in line with small world nature by analyzing the ancient Chinese vocabulary network commonly used statistical indicators.
Keywords/Search Tags:Ancient Chinese corpus, Humanities Computing, Word Segmentation, Conditional Random Fields, Complex Network, Zipfs Law
PDF Full Text Request
Related items