Font Size: a A A

The Research And Implementation Of Microblog Retrieval System Based On Word Embeddings

Posted on:2018-09-16Degree:MasterType:Thesis
Country:ChinaCandidate:W T XuFull Text:PDF
GTID:2348330533455245Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet and information technology,Microblog is playing an increasingly important role as a new form of social networking platform where people can publish and consume information in our daily life.The data of volume of Microblog platform is growing at all times.How to accurately and efficiently find out the information that can meet the needs of users from the microblog data is the urgent problem to be solved in the current microblog retrieval.Most of the current microblog retrieval is based on the traditional retrieval method of keyword matching.The advantage of keyword matching retrieval method is efficient computational performance and mature weighting theory.However,it only uses the keyword literally matching to retrieve the microblog documents,and only matches those which contain the query word.It can not understand the microblog documents in the context of the relationship,and can not retrieve those which do not contain query terms but with the query-related microblog documents.This makes a larger gab of the retrieval results and the user's query needs,resulting in the user's search experience is poor.In recent years,the emerging word embeddings technology can obtain the context semantic information of the term from the large-scale text corpus.According to the above problem in microblog retrieval,this paper proposes a microblog retrieval algorithm based on word embeddings,named MRA-E.This algorithm uses the context semantic relationship between the query and the microblog document to perform the microblog retrieval,and can fully excavate the microblog document related to the query.The main idea of the MRA-E is:(1)using the Skip-gram model's accurate semantic acquisition ability to obtain word embeddings with rich semantic information;(2)obtaining the vector representation of the microblog document and query by word vector weighted average;(3)As the size of microblog document set is very large,in order to reduce the computational complexity and improve the retrieval performance,a two-step retrieval strategy is proposed,which pre-retrievals microblog documents based on the improved BBF approximate Knearest neighbor algorithm and sorts microblog documents based on the simplified WMD document distance algorithm.On the basis of MRA-E algorithm,this paper designs and implements a microblog retrieval system.In the process of building the system,how to get high-quality word embeddings is a difficult point.To this end,we implement a microblog data acquisition module,specifically for obtaining the microblog data from the Sina microblog website.In order to improve the quality of the input corpus of the word embeddings training process,the microblog data preprocessing module preprocesses the microblog data from the hashtag label processing,word segmentation of the microblog document and the stop word removal.In addition,considering that microblog has characteristics of short text,in order to make the word embeddings fully understand the contextual semantic realations between the words,a long text of Wikipedia chinese corpus is also added to the input corpus of the word embeddings training process.Finally,the experiment is conducted on the crawled microblog data.The results show that the retrieval with MRA-E has better retrieval effect compared with the traditional retrieval method based on keyword matching,and the feasibility of the algorithm proposed in this paper is verified.
Keywords/Search Tags:Microblog Retrieval, Word Embeddings, Information Retrieval, MRA-E, Skipgram
PDF Full Text Request
Related items