Font Size: a A A

Chinese Word Semantic Similarity Measure And Its Application In Cross-language Information Retrieval

Posted on:2011-10-18Degree:MasterType:Thesis
Country:ChinaCandidate:L PengFull Text:PDF
GTID:2208360305497946Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As an important work in the field of Natural Language Processing (NLP), Word Lexical Semantic Similarity Measurement (WLSSM) has always been the focus of study. Semantic similarity itself is an intermediate task, which is the indispensible level of most NLP tasks, and is widely used in NLP tasks such as word sense disambiguation, information retrieval and machine translation.This thesis focuses on Chinese WLSSM algorithms and its application in Cross Language Information Retrieval (CLIR). Generally speaking, this thesis first reviews the semantic similarity algorithm, and then concentrates on describing HowNet-based Chinese WLSSM pattern, divides the WLSSM into three parts by providing the grammatical rules of Knowledge Database Mark-up Language (KDML), calculates each part by the maximum matching algorithm, and adds depth information of sememes for distinguishing the different information contents of sememes. Compared with some classic measurements, the proposed method uses the organization structure of HowNet to extract abundant semantic information and optimizes the algorithm of sememe similarity, which can distinguish word pairs into different semantic similarity levels. The experiment results are more consistent with human being's subjective feeling.On the other hand, we attempt to apply Chinese WLSSM pattern in CLIR, including the following aspects:1. Query Translation:using semantic similarity to disambiguate keywords for the purpose of better translation.2. Query Expansion:with the aim of obtaining high relevance between original and expanded queries for better recall and precision.This thesis also proposed some evaluation algorithms. SENSEVAL-3 corpus is used to evaluate the performance of word sense disambiguation, and TREC-9 CLIR corpus and topics are used to evaluate the performance of CLIR. The experiment data and results are relatively impartial, objective and comparable. Besides, the original CLIR system is adopted and updated into a unified CLIR platform, which conveniently unified for all kinds of relevant algorithms in the system. The systematical design presents its modularity and extensibility fully.In summary, this thesis proposes a new HowNet-based Chinese WLSSM pattern, according to comprehensive analysis of mainstream semantic similarity algorithm, and introduces WLSSM to CLIR system as a tentative application, in the hope of providing the reference for researchers in related fields.
Keywords/Search Tags:Chinese Word Lexical Semantic Similarity, Cross language Information Retrieval (CLIR), HowNet, Word Sense Disambiguation (WSD), Query Expansion
PDF Full Text Request
Related items