Font Size: a A A

Research On Information Metric For Text

Posted on:2014-07-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:F BuFull Text:PDF
GTID:1268330422960353Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Metric is used for characterizing the relationship between objects. In natural lan-guage processing (NLP), researches on information metric on diferent linguistic unitsare of essential research value and wide application backgrounds.Recently, rapid development of web2.0poses great challenges to natural languageprocessing. Classical NLP information metrics are not able to handle complex and dy-namic internet data and informally written web texts. For example, lexical similaritymetrics based on local dictionary are not suitable to process new words emerged on inter-net. Sentence level similarity metrics based on syntactic trees are not sound to be used tomeasure the similarity between user queries and document titles, especially in Chinese.Moreover, classical metrics based on link analysis cannot make full use of the structuralfeatures of social collaborative data.In order to deal with these challenges, we proposed new information metrics on fourdiferent information objects, which are listed as follows.On phrase level, we proposed a non-compositionality metric for n-grams, which isbased on information distance with solid theoretical background. It can be used tomeasure the non-compositionality of a given n-gram (under certain contexts). Sincethis metric is approximately computed from the frequency counts on the internet, itis robust and widely applicable, which can be used for post-possessing of questionanswering and complex named entity recognition.On concept level, we proposed a new algorithm for measuring the semantic relat-edness between concepts on social collaborative encyclopedia (e.g. Wikipedia).Diferent from classical metrics based on link analysis, our method fully took ad-vantage of the structural feature of encyclopedia. It can not only measure related-ness, but also interpret the relatedness by using categories.On sentence level, we proposed a question similarity metric based on pattern set.To utilize function words and content words in questions, we built hard patternsand soft pattern on them respectively. The metric can model long range depen-dencies between words without using syntactic trees and be applied to questionclassification.On sentence relation level, we proposed a sentence relation similarity metric based on kernel method, which maps sentence pairs onto re-writing rules space and usesinner product on this space to represent similarity. The method can capture struc-tural similarity between sentence pairs without using syntactic analysis tools andstill achieve state-of-the-arts accuracy on paraphrasing identification and recogniz-ing textual entailment.
Keywords/Search Tags:Natural Language Processing, Information Metric, Information Distance, Kernel Method
PDF Full Text Request
Related items