Font Size: a A A

Research On Query Optimization And Vectorization Technique In Document Retrieval

Posted on:2019-08-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Y XiongFull Text:PDF
GTID:1368330623950404Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the popularization of internet and the rapid development of computational technology,the massive network information processing is an important research subject in today's big data processing.People not only distribute and acquire information on the internet,but the important result is to extract data for the benefit of everyday life and product new economic and social gain.It makes more and more attentions for extracting and using processing results from massive documents,it is of wide application prospects.The data types for querying in documents are divided as structural and non-structural,classical querying methods include: the boolean model based on set theory,the vectorization space model based on algebra,the probabilistic model based on probability and statistics and the machine learning model based on statistics.All these models are adopted for user's query in documents,performing relative sorting according to the matching score in the computation for each document in documents set,and forming querying results.With the rapid raise of documents information,the classical documents querying techniques are limit on the accuracy match in the querying results,also on the querying efficiency and performance.The new documents querying techniques need to deal with more and more complex and huge document data,hence,accurate and efficient querying techniques are expect to develop rapidly,it is of significant theoretical meaning and application value for researching revised and optimized querying models,also for developing distributed vectorization technique based on deep learning.The main work and achievements of this paper are as follows:1.Towards the semantic loss phenomena,based on the Markov random field(MRF)and Lkp model,we propose a revised scoring model of higher order proximity for querying in document set,the experimental results show that the revised model is different from the original model in query score calculation,which shows the advantage of the revised model in the performance of high order proximity query.2.Towards the style of writing in document,the topic sentences generally appear in the front or the rear part in document,hence,we introduce interval tree for computing match score in querying,by combining interval tree scoring with Score Comp and Freq Comp models respectively,a new document querying model is proposed.The experimental results show the contrastive analysis results for two models of Score Comp and Freq Comp with the interval tree scoring respectively,the ScoreComp model based on interval tree represents a more sensitive semantic relation between word items.3.Towards the problem of long time for learning distributed word vector,based on the models of n-gram?CBOW?Skip-Gram and hiberarachy Softmax,we establish a optimal strategy for querying,and propose an extended optimal model for constructing distributed word vector,the experimental results show that the new distributed word vector generation optimization CBOW-OR OR skipgram-OR model is more reasonable than CBOW and Skip-Gram model in indirect expression of the semantic relation between words.4.Towards the problem of learning blindness in construction of the distributed paragraph vector,we propose a hybrid method combined by CBOW and deep learning CNN method for constructing paragraph vector,the paragraph vector generated by the new hybrid method combined CBOW model and CNNs model is more reasonable than that obtained by the CBOW model in expressing paragraphs topics.
Keywords/Search Tags:Document information querying, Distributed word vector, Distributed paragraph vector, CBOW model, Skip-Gram model
PDF Full Text Request
Related items