Research On Automatic-identification Of Punctuated Phrases In Chinese Complex Sentences

Posted on:2009-03-25

Degree:Master

Type:Thesis

Country:China

Candidate:X J Yu

Full Text:PDF

GTID:2178360245957396

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Corpus linguistics had been rapidly developing in 1990s.The thrust of corpus linguistics is directly studying and automatically processing the language information by computer in large-scale genuine text corpus.However,the using value of the original corpus is limited.If some linguistic knowledge is added to the original corpus in advance, the computer will acquire more useful information automatically.So a deeply processing corpus becomes one of the researching hots.As for CIP,it includes character-processing,word-processing,sentence-processing and discourse-processing.So far,most of research achievements are on the basis of the character and the word.And because of the difficulty of Chinese,the research on sentence-processing,especially on complex-sentence,is relatively less.So we should realize the transition from character-processing and word-processing phase to sentence-processing phase gradually.Chinese sentences are classified as single-sentence and complex-sentence according to the structure.The sense capacity of complex-sentence is usually larger than single-sentence and often expresses logic significance of relations between persons and things.Complex-sentence is often made up of clause which is actually a single-sentence in view of structure.Being a grammatical entity,it plays a convergence role between single-sentence and chapter.In the meanwhile,complex-sentence has some attributes of grammar,semantics and pragrnatics,which make the study on complex sentences even more important.And research of level relation is the soul of complex-sentence study to some degree.Before making a study on level structure and relation in complex-sentence,we first should make clear the structure of clauses and determine which language segments are genuine sentences and which are not.Combined with the concerning theory of linguistics, we first build an obvious-formal-tag database and a formal rule database based on part of speech to recognize the punctuated phrases and the identification rate reaches 86.1%in an open test.Second we make use of a statistical method to recognize punctuated phrases whose identification rate reaches 81.7%.The method is based on calculating credibility as the possibility of a language segment to be a punctuated phrase.Finally we do a further research on using the method of Cluster Analysis to identify punctuation phrases and this method make the identification rate reaches to 89.3%.

Keywords/Search Tags:

language segment, punctuated phrase, obvious-formal-tag database, a formal rule database, credibility, Cluster Analysis

PDF Full Text Request

Related items

1	Research On Formal Analysis For Security Of Database Management System
2	Design and formal specification of a data model and language for a database system for CAD applications
3	Design And Parse Of Formal Specification Language For Security Protocols
4	Research And Application Of Formal Language In Report Systems
5	Study On The Automatic Testing Method For The Completeness Of Formal Specifications Based On SOFL
6	Formal Study On Key Issues In Classification Rule Mining Based On Formal Concept Analysis
7	Research On Web Database Extraction Based On Formal Concept Analysis
8	Research Of Rapid Knowledge Reduction Algorithm Of Information System Based On Formal Vector
9	Research On Formal Concept Analysis Methods And Applications Based On Topology Of Attribute Set
10	Formal characterizations for active database specification, verification and maintainability