Research On Information Metric For Text

Posted on:2014-07-16

Degree:Doctor

Type:Dissertation

Country:China

Candidate:F Bu

Full Text:PDF

GTID:1268330422960353

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Metric is used for characterizing the relationship between objects. In natural lan-guage processing (NLP), researches on information metric on diferent linguistic unitsare of essential research value and wide application backgrounds.Recently, rapid development of web2.0poses great challenges to natural languageprocessing. Classical NLP information metrics are not able to handle complex and dy-namic internet data and informally written web texts. For example, lexical similaritymetrics based on local dictionary are not suitable to process new words emerged on inter-net. Sentence level similarity metrics based on syntactic trees are not sound to be used tomeasure the similarity between user queries and document titles, especially in Chinese.Moreover, classical metrics based on link analysis cannot make full use of the structuralfeatures of social collaborative data.In order to deal with these challenges, we proposed new information metrics on fourdiferent information objects, which are listed as follows.On phrase level, we proposed a non-compositionality metric for n-grams, which isbased on information distance with solid theoretical background. It can be used tomeasure the non-compositionality of a given n-gram (under certain contexts). Sincethis metric is approximately computed from the frequency counts on the internet, itis robust and widely applicable, which can be used for post-possessing of questionanswering and complex named entity recognition.On concept level, we proposed a new algorithm for measuring the semantic relat-edness between concepts on social collaborative encyclopedia (e.g. Wikipedia).Diferent from classical metrics based on link analysis, our method fully took ad-vantage of the structural feature of encyclopedia. It can not only measure related-ness, but also interpret the relatedness by using categories.On sentence level, we proposed a question similarity metric based on pattern set.To utilize function words and content words in questions, we built hard patternsand soft pattern on them respectively. The metric can model long range depen-dencies between words without using syntactic trees and be applied to questionclassification.On sentence relation level, we proposed a sentence relation similarity metric based on kernel method, which maps sentence pairs onto re-writing rules space and usesinner product on this space to represent similarity. The method can capture struc-tural similarity between sentence pairs without using syntactic analysis tools andstill achieve state-of-the-arts accuracy on paraphrasing identification and recogniz-ing textual entailment.

Keywords/Search Tags:

Natural Language Processing, Information Metric, Information Distance, Kernel Method

PDF Full Text Request

Related items

1	Research On Matrix-based 2D Distance Metric Learning And Spatial Euler Kernel With Applications
2	Research And Implementation Of Natural Language Information Hiding Algorithm Based On Abstract Embedding Unit
3	Research On The Method Of Prediction Of Audit Suspects Based On Natural Language Processing Technology Under The Background Of Informationization
4	Research On High Risk Information Processing Module Of Internet Public Opinion Based On Natural Language Processing
5	Research On Machine Learning For Natural Language Processing And Transmission
6	Natural Language Processing Aiming To The Core Texts Of Scientific Literature
7	Research And Application Of Natural Language Processing In Information Retrieval
8	Research On NLP Technologies And Application In Chinese Information Processing
9	Research On Natural Language Watermarking Based On Syntactic Transformations
10	Narrative Information Extraction with Non-Linear Natural Language Processing Pipeline