Font Size: a A A

Research On Language Modeling Based Sentence Retrieval

Posted on:2008-09-30Degree:MasterType:Thesis
Country:ChinaCandidate:L Q GaoFull Text:PDF
GTID:2178360245997723Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Information Retrieval (IR) has been one of fastest growing research fields inComputer Science. Sentence Retrieval (SR), treated as special IR, is the technologyon sentences searching and ranking, which are based on detailed information. SR iswidely used in Question Answering (QA), Document Summerization (DS), MachineTranslation (MT) and more research fields. The research task was proposed initiallyfrom a system for writing assistant, named HIT IR Lab English-Chinese BilingualPhrase Retrieving System (HIT IR Lab ECBPRS). The thesis presents related re-search issues within this task. One of the difficulties that SR faces lies in the lackof enough word and context information. Sentences, usually consist of no more thanthirty words, have few contexts compared to the documents in Text Retrieval. TheWord Mismatching Problem between the queries and sentences is more significant inSR due to the difficulty. How to make good use of the limit information such as theorder of words and the structure of the sentences is the practical way to solve problemslike word mismatching in SR.The research explores from both the query perspective and the sentence per-spective. Queries are the expression of users'information need. Ill-form queries,containing spelling mistakes, misuse of synonyms and incorrect word forms, may of-ten lead to failure in retrieving information in SR because of the Word MismatchingProblem caused by the queries. The Query Modification Model has been proposedto correct the input queries by performing several kinds of modifying operations onthem. This is a different way from traditional methods such as Query Expanding andDelete Query Terms, which will help boost the precision of SR.Before the research into sentence retrieval, a language model based on the wordsense representation is proposed. It takes the word sense into account in the statisti-cal language models for IR, which help match between queries and documents withsynonyms. In the SR part, a linear discriminative model (LDM) is used as the ba-sic ranking function of sentence retrieval model. Unigram, Distance based methodsand Word Sense Language Model all have been combined into the LDM as featurefunctions. LDM is superior to traditional generative methods such HMM because it assumes less about the distribution of queries and documents and it can be optimizedwith MAP. Besides, user feedback can change the LDM, which is suitable to person-alized SR. In the last part, more details about the infrastructure and implementation ofHIT IR ECBPRS are presented.In a word, it is practical and meaningful to do research on problems in SentenceRetrieval, especially in a way based on language modeling. We believe that SR willplay an important in promoting the connection between IR and NLP in many applica-tions.
Keywords/Search Tags:sentence retrieval, word mismatching, word sense language model, query substitution, linear discriminative model
PDF Full Text Request
Related items