Font Size: a A A

Research On Chinese Phrase Annotation And Calculation Based On Multi-level Corpus

Posted on:2013-03-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:J XieFull Text:PDF
GTID:1318330482952382Subject:Information Science
Abstract/Summary:PDF Full Text Request
With the explosion of textual information,it is a real challenge for information retrieval with high efficiency.Researchers have developed several different methods of intelligent information retrieval technology to improve the search efficiency based on automatic storing,indexing and information extraction.These methods are classification,knowledge representation,ontology and text mining,etc.Some of the methods need domain knowledge with the help of domain experts,while others are done by the computer using artificial intelligent.In the first case,it will require a huge amount of manpower,and the search results are usually available only with a lag.In the second case,computer can easily obtain huge quantity of searching result with large-scale redundancy.External factors of text are introduced to optimize the result,such as the reference of different texts.In this paper,we propose a new method as the solution from the linguistic points of view by the composition of Zhu dexi's phrase-based theory about Mandarin Chinese,Abney's theory about chunk,Chen xiaohe's theory of Grammar Function Matching and Feng zhiwei's grammatical analysis on terms.We can obtain the inter-relationship among terms in academic literature from the linguistic aspect based on the grammar functions of Chinese phrase.The relationship is different from the word similarity,as it exists in the grammar level of Chinese phrase.The relationship of keywords and terms in papers can be obtained not only by co-occurrence,but rather by grammatical links.The analysis on the grammar function of keyword and terms prove the phrase-based theory of Zhu dexi about Mandarin Chinese.In his theory,the one-to-one mapping between parts of speech and grammar functions of Chinese phrases is impossible.One certain part of speech can be corresponding to several grammar functions of Chinese phrases.Furthermore,Zhu dexi proposes the construction of Chinese sentences and phrases can be treated as the same procedure.We test his assumption in academic literatures,the result shows the semantic role of certain word is different in different position of Chinese phrases according to the grammar function of these phrases.In Chinese information retrieval,the change of certain word in different phrases will lead to different meaning according to the phrase-based theory of Mandarin Chinese.This change can't be observed by word co-occurrence,but by grammatical function analysis.For different keyword phrases of high similarity,we can also do this by the extraction of the main constitution of the phrases from the grammar function points of view.The relationship between large baseNP phrases of keywords can be extracted according to grammatical function.And in most case,the relationship tends to be directed which is different from the undirected links among co-occurrence ones.These directed links can make up complex networks in which most links between each two nodes are directed.We obtain the linguistic features of Chinese Phrases by observing the large scale corpus of Chinese Treebank.The phrase tags are different in different corpus as the annotation scheme are different.But from the bibliometrics points of view,we find the distribution of phrase in different corpus yield a similar curve which is not hypodispersion.The key problem of grammatical function analysis about phrases is how to identify them from the textual data.Different algorithms lead to different result in different annotation scheme,the result of the same algorithms on different phrases may also be different.In this paper,we contrast the results in three Comparative tests.The results show CRF model is more excellent for phrase series identification,even about different phrase or in different annotation scheme.Finally,we use these phrase knowledge learning from large scale corpus based on Mandarin Chinese phrase as the linguistic features for CRF learning.We can easily extract the grammar function phrase from the CSSCI textual data,and analyze the grammatical links between keywords and terms.
Keywords/Search Tags:multi-level corpus, phrase-based theory, grammar function, chunk, conditional random field, phrase distribution, phrase identification, machine learning, CSSCI, term phrase annotation
PDF Full Text Request
Related items