Font Size: a A A

Research On Syntactic Knowledge Mining And Extraction Based On English-chinese Parallel Corpus

Posted on:2013-02-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:D B WangFull Text:PDF
GTID:1318330482452375Subject:Information Science
Abstract/Summary:PDF Full Text Request
With the development of the technology of the natural language processing and text mining,there has been an increasing trend to mine and extract corresponding knowledge from unstructured documents.Under this trend,the research into mining and extracting the syntactic knowledge of words,as well as simple and complicated phrases based on the English-Chinese parallel corpus constructed from the Internet is carried out.The knowledge mined and extracted from the corpus is not only beneficial to the researches in informatics,such as the construction of knowledge database,knowledge service,information retrieval and information meters,but also contributes to solving the problems in NLP,which includes ambiguity resolution,knowledge extraction,machine translation and computer-aided translation.Based on a variety of models and algorithms,the paper,employing many kinds of methods and corpora,succeeds in mining and extracting the syntactic knowledge of words,as well as simple and complicated phrases.The introduction part elaborates on the background,significance,innovative points,whole process,and framework of the paper.Besides,the resources used in all the experiments of the research are also described in this part.The literature review part summarizes the relevant researches into mining and extracting the syntactic knowledge of words,as well as simple and complicated phrases.The general-purpose and special English-Chinese corpus is obtained from the website and the corresponding corpus database is also constructed.This part mainly centers on discussing a series of research questions,like how to determine the grabbed websites,work out the grabbed glossary,get the web pages by the grabbed tool,extract the English-Chinese parallel corpus from the grabbed websites,sift through the corpus and remove the duplicate words or phrases from the corpus.By adopting the research methods and knowledge in informatics,the Lotka phenomenon,reflected in English-Chinese words' syntactic function distribution complexity,is detected at the word level.Based on the English-Chinese parallel corpus,Penn Treebank and Tsinghua Treebank,this part makes a statistical analysis of the English-Chinese syntactic function distribution and the English-Chinese words' syntactic function distribution complexity.Then the value of the complexity is calculated,which helps the author of the paper discover the Lotka phenomenon Based on the conditional random fields,the model of extracting the structural knowledge of English-Chinese phrases of prepositional object is constructed and the extracting process is also presented at the level of simple phrases.This part introduces the internal and external syntactic features of the phrases of prepositional object statistically,and presents the pre-processing format of the training data of this kind of phrases.What is more,this part illustrates the content of selt-features template and addition-features template in detail,and compares it with the performance of maximum entropy.At the level of complicated phrases,this paper,based on clustering algorithm,explores the mining of the category knowledge in English-Chinese special parallel corpus by constructing the mining model of the category knowledge reflected in the features of words and parts of speech.The results of the specific experiments in this paper verify the performance of the features of English-Chinese words used in the category knowledge mining,and the reasons for the differences in mining the category knowledge are given.The performance of parts of speech in mining the category knowledge is explored by employing the knowledge of the features of words and parts of speech on the basis of the part-of-speech sequence of noun,noun and verb,as well as noun,verb and adjective.
Keywords/Search Tags:English-Chinese parallel corpus, Treebank, Word's syntactic function distribution complexity, Lotka phenomenon, Phrases of prepositional object, Conditional random fields, Clustering algorithm, Word and part-of-speech feature
PDF Full Text Request
Related items