Font Size: a A A

Research On Lexical Semantic Similarity Measurement Based On Knowledge Integration

Posted on:2017-04-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Y CaiFull Text:PDF
GTID:1108330491451543Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data, a huge mass of textual data provide valuable information, however, also cause lots of tough challenges. Word is the basic component unit of text, so lexical semantic similarity measurement plays an important role in mining the association of words, and also enables computers to understand sentences and doc-uments accurately. In terms of lexical semantic resources, the measurement methods of semantic similarity are mainly classified into knowledge base-based and corpus-based. Knowledge bases provide lexical semantic description and structured information; how-ever, they depend heavily on artificial experiences for construction and maintaining and have low lexical coverage and extensibility. While a corpus commonly contains a co-pious vocabulary, it is hard to distill it down to just exactly the semantic features for representing words. To overcome the limitation of single resources used in semantic sim-ilarity measurement, in this dissertation, we focus on the graphical structure of WordNet and low-dimensional word vector representation, and study how to integrate the semantic knowledge derived from knowledge base and corpus, in terms of IC computational mod-el, semantic-augmented word vector and combinational optimization of methods. The main contributions of this dissertation are listed as follows:(1) It presents a concept semantic similarity measurement based on IC-weighted short-est path (CSSM-ICSP) which aims to nonlinearly transfer path distance between concepts to the semantic similarity. This method adopts various structure proper-ties of concepts such as edge length, depth and density, as well as the information content (IC) of concepts. Firstly, we build the Intrinsic IC Hybrid (IIH) model which smooth concept density by depth-related nonlinear function to address the lack of concept depth in traditional IC computational models. Secondly, each edge between concepts is weighted by the difference of IC values of concepts, which reflects non-uniform intensity of semantic relationships of concepts with different depths in hierarchy. Then we compute the IC-weighted path distance, depth over-lap and normalized path distance for obtaining a new calculation model of path distance. In addition, we introduce the hybrid computation of intrinsic IC value and statistical IC value into similarity measurement, which realize the integration of semantic knowledge from WordNet and corpus. The experiments are conducted on public benchmark datasets which consist of M&C, R&G, WS-353 and WS-sim. The experimental results show that compared against other WordNet-based mea-surement methods, the proposed methods reach higher Pearson coefficient.(2) It presents a word semantic similarity measurement based on multiple semantic fu-sion (WSSM-MSF) which aims to improve the vector space based measurement by means of effective lexical semantic representation. Considering the semantic content of document can be represented by the vector composition between sen-tences, phrases or words, we build a multiple semantic fusion (MSF) model based on the algebraic operations of vectors of multiple semantic properties in WordNet, including synsets, glosses, hypernyms and hyponyms. In this way, the MSF model generates concept vectors and semantic-augmented word vectors and implements the integration of heterogeneous knowledge based on semantic features. In order to avoid the problems of data sparse and high-dimension feature, we use the con-tinuous Bag-of-Words (CBOW) based on neural network to learn low-dimensional, dense and real-valued word embedding from large-scale corpus. Experimental re-sults show that the semantic-augmented word vectors improve the expression ca-pability of original word vectors, and enable the performance improvement in both word similarity evaluation and semantic-oriented Web service matching, in term of accuracy, precision and recall.(3) It presents a word semantic similarity measurement based on differential evolu-tionary (WSSM-DE) which takes the optimizing combination of various mea-surement methods as the stochastic optimization process in solution space. This novel method represents each dimension of an individual in DE algorithm as WordNet-based computational methods or the computational methods based on low-dimensional word vector. Then the optimal weighting of each dimension and the optimum solution are produced by heuristic global search based on individu-al differences, which realizes the semantic knowledge integration of WordNet and corpus. Then we analyze the spaces where the word vectors distributed on basis of the change of weighting values of each dimension. According to the experimen-tal results on word similarity evaluation, the proposed method outperforms others based on supervised learning algorithms, including Learning to Rank (LTR) and regression. And it also improves the accuracy of measurement on a single type of semantic resource. Specially, the combination of the proposed method with semantic-augmented word vector has significant improvement in term of the mea- surement accuracy.In summary, compared against other methods, the three proposed semantic similar-ity computational methods focus on integrating the semantic information derived from heterogeneous resources to improve lexical semantic similarity measurement. Their ap-plicability depends on the type and scale of semantic resources as well as the type of evaluation tasks.
Keywords/Search Tags:Knowledge integration, Semantic similarity measurement, IC quantita- tive model, Semantic augmentation, Low-dimensional word embedding, Differential evolution
PDF Full Text Request
Related items