Font Size: a A A

Research And Implementation Of Semantic Search Engine Based On Statistical Characteristics

Posted on:2016-02-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y K ZhuFull Text:PDF
GTID:2298330467492979Subject:Communication and Information Engineering
Abstract/Summary:PDF Full Text Request
Resource search that it returns the appropriate resource set based on user’s need is the indispensable key technology of resource management. The traditional resource search returns exact matched resources based on keyword matching technology. However, there are synonyms and polysemys in the natural language, and the same concept can be expressed in many different ways, so there are two problems in the traditional resource search, as following:first, the user is difficult to use the keyword or keyword strings to express real need to query the content; it’s incredible to determine the similarity of two documents by the number of common words in the two documents.In this paper, we mine the latent semantic features of words and documents from the perspective of statistical characteristics. In the aspect of words, we propose extraction algorithms of Chinese synonyms, and in the aspect of documents, we introduce document semantic vectors, combining with NBSVM-bi we improve the accuracy in sentiment analysis. To solve the problem of information loss due to the inputted query bias, we propose to extract distributed embedding representation of word based on neural network language model--CBOW and Skip-Gram, then combine with the random forest classifier algorithm to retrieve Chinese synonym extraction. Through query word expansion, we can overcome the information loss problem caused by user inputted query bias. For document similarity calculation, we introduce document vector to improve CBOW model and Skip-Gram model. Through jointing document vector with the word vector as input to train neural network model, the final vector of document can be regarded as semantic features of the document. The document similarity calculation based on semantic features of document solves low credibility of the similarity calculation by counting the number of common words in the two documents.Finally, we realized semantic search engines for television programs based on the research of distributed embedding representation of word and semantic features of document. We add the module of synonyms expansion of the query words and rewrite the module of ranking based on the documents’semantic vector to retrieve semantic searching.
Keywords/Search Tags:neutral network language model, distributed embeddingrepresentation, CBOW model, Skip-Gram model, semantic search
PDF Full Text Request
Related items