Font Size: a A A

Design And Implementation Of News Long Text Retrieval Method

Posted on:2022-11-26Degree:MasterType:Thesis
Country:ChinaCandidate:S T LiuFull Text:PDF
GTID:2518306764477234Subject:Journalism and Media
Abstract/Summary:PDF Full Text Request
In this era of explosive growth of Internet information,people's demands for information acquisition,processing and other applications have developed by leaps and bounds.People can get long text information such as news they need from many channels,and these are increasing day by day.The long text data will also cause a series of problems such as information overload and difficulty in retrieval.The existing traditional keyword retrieval or short text retrieval retrieval logic is to perform Top N matching through the inverted index formed by the input keyword and the text data in the database to finally obtain the corresponding result.In the face of long text field retrieval,large volume In the news long text scene under the amount of data,the search effect is not satisfactory.In order to solve the above problems,this thesis proposes and implements a solution that combines text vectorization technology with vector retrieval and applies it to long news text retrieval scenarios.In the early stage of research,this thesis analyzes the shortcomings of the existing text retrieval technology,and finds that the traditional keyword retrieval and semantic retrieval only simply divide the text into paragraphs and perform retrieval segment by segment,and the retrieval results are still inaccurate and relevant.Disadvantages of low degree.Therefore,based on the above problems,this thesis redesigns the algorithm architecture part,and verifies the validity of the algorithm architecture proposed in this thesis in combination with the corresponding application scenarios.s solution.First,construct a news long text dataset.By building a collection system,we use crawlers to collect Internet news,perform data preprocessing on the data stored in the database,and eliminate useless and dirty data,so as to adapt to the retrieval of long news texts.Second,the innovative introduction of text vectorization technology.By using a pre-training model to convert long news texts into numerical vectors,the effect of text vectorization is achieved,and these vectors are retrieved through vector retrieval technology,thereby achieving the purpose of retrieving long news texts.Third,multimodel mixing.According to the characteristics of text vectorization,based on the BERT model,the Ro BERT,Distil BERT,and XLNet models are mixed,so that text vectorization can be applied in news long text retrieval tasks.This thesis achieves the effect of improving the efficiency of news long text retrieval by combining multi-model hybrid technology.After that,the news long text retrieval method was integrated into the system,and a compound news long text retrieval system integrating the functions of news long text collection and news long text retrieval was completed.Finally,we design several groups of experiments and system tests to verify the effectiveness of the scheme.In the experiment,the hybrid model is compared with multiple single models,which proves the effectiveness of the model,and the relevant parts are reasonably verified.After analyzing the experimental results,it is concluded that the model in this thesis can perform well in the news long text retrieval method.draw a better conclusion.In the system test,tests including user management news long text retrieval,data management,crawler configuration and front-end and back-end interaction were completed.A news long text retrieval system with relatively sound functions and easy to use is obtained.
Keywords/Search Tags:BERT, Multi-model Hybrid, Text Vectorization, Long Text Retrieval
PDF Full Text Request
Related items