Font Size: a A A

Research On Automatic-identification Of Punctuated Phrases In Chinese Complex Sentences

Posted on:2009-03-25Degree:MasterType:Thesis
Country:ChinaCandidate:X J YuFull Text:PDF
GTID:2178360245957396Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Corpus linguistics had been rapidly developing in 1990s.The thrust of corpus linguistics is directly studying and automatically processing the language information by computer in large-scale genuine text corpus.However,the using value of the original corpus is limited.If some linguistic knowledge is added to the original corpus in advance, the computer will acquire more useful information automatically.So a deeply processing corpus becomes one of the researching hots.As for CIP,it includes character-processing,word-processing,sentence-processing and discourse-processing.So far,most of research achievements are on the basis of the character and the word.And because of the difficulty of Chinese,the research on sentence-processing,especially on complex-sentence,is relatively less.So we should realize the transition from character-processing and word-processing phase to sentence-processing phase gradually.Chinese sentences are classified as single-sentence and complex-sentence according to the structure.The sense capacity of complex-sentence is usually larger than single-sentence and often expresses logic significance of relations between persons and things.Complex-sentence is often made up of clause which is actually a single-sentence in view of structure.Being a grammatical entity,it plays a convergence role between single-sentence and chapter.In the meanwhile,complex-sentence has some attributes of grammar,semantics and pragrnatics,which make the study on complex sentences even more important.And research of level relation is the soul of complex-sentence study to some degree.Before making a study on level structure and relation in complex-sentence,we first should make clear the structure of clauses and determine which language segments are genuine sentences and which are not.Combined with the concerning theory of linguistics, we first build an obvious-formal-tag database and a formal rule database based on part of speech to recognize the punctuated phrases and the identification rate reaches 86.1%in an open test.Second we make use of a statistical method to recognize punctuated phrases whose identification rate reaches 81.7%.The method is based on calculating credibility as the possibility of a language segment to be a punctuated phrase.Finally we do a further research on using the method of Cluster Analysis to identify punctuation phrases and this method make the identification rate reaches to 89.3%.
Keywords/Search Tags:language segment, punctuated phrase, obvious-formal-tag database, a formal rule database, credibility, Cluster Analysis
PDF Full Text Request
Related items