Research And Implementation Of Semantic Search Engine Based On Statistical Characteristics

Posted on:2016-02-28

Degree:Master

Type:Thesis

Country:China

Candidate:Y K Zhu

Full Text:PDF

GTID:2298330467492979

Subject:Communication and Information Engineering

Abstract/Summary:

PDF Full Text Request

Resource search that it returns the appropriate resource set based on userâ€™s need is the indispensable key technology of resource management. The traditional resource search returns exact matched resources based on keyword matching technology. However, there are synonyms and polysemys in the natural language, and the same concept can be expressed in many different ways, so there are two problems in the traditional resource search, as following:first, the user is difficult to use the keyword or keyword strings to express real need to query the content; itâ€™s incredible to determine the similarity of two documents by the number of common words in the two documents.In this paper, we mine the latent semantic features of words and documents from the perspective of statistical characteristics. In the aspect of words, we propose extraction algorithms of Chinese synonyms, and in the aspect of documents, we introduce document semantic vectors, combining with NBSVM-bi we improve the accuracy in sentiment analysis. To solve the problem of information loss due to the inputted query bias, we propose to extract distributed embedding representation of word based on neural network language model--CBOW and Skip-Gram, then combine with the random forest classifier algorithm to retrieve Chinese synonym extraction. Through query word expansion, we can overcome the information loss problem caused by user inputted query bias. For document similarity calculation, we introduce document vector to improve CBOW model and Skip-Gram model. Through jointing document vector with the word vector as input to train neural network model, the final vector of document can be regarded as semantic features of the document. The document similarity calculation based on semantic features of document solves low credibility of the similarity calculation by counting the number of common words in the two documents.Finally, we realized semantic search engines for television programs based on the research of distributed embedding representation of word and semantic features of document. We add the module of synonyms expansion of the query words and rewrite the module of ranking based on the documentsâ€™semantic vector to retrieve semantic searching.

Keywords/Search Tags:

neutral network language model, distributed embeddingrepresentation, CBOW model, Skip-Gram model, semantic search

PDF Full Text Request

Related items

1	Research On Query Optimization And Vectorization Technique In Document Retrieval
2	The Optimization And Implementation Of The Efficiency And Performance Of Chinese Language Model Based On Recurrent Neural Network
3	N-gram Language Model Based On Distributed System
4	Researching And Building Of The Mongolian Large Vocabulary Independent Continuous Speech Recognition System
5	Mongolian Language Model Based On Recurrent Neural Network
6	Research On Jointly Learning Word Embeddings And Latent Topics In Text
7	Research On Word2vec Algorithm Based On Context Distance
8	Application Research On Statistical Language Model Of Large Vocabulary Continuous Speech Recognition System
9	Mining Of Semantic Similar Items Based On Cross-Language Mapping
10	Research And Application Of Multilingual Text Embedding Model