Font Size: a A A

Research On Network News Corpus Construction And Its Distributed Retrieval System

Posted on:2020-03-07Degree:MasterType:Thesis
Country:ChinaCandidate:S LuFull Text:PDF
GTID:2428330578952714Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The online news corpus is based on corpus linguistics and other related theories,and uses a network crawler and other technical means to randomly collect real news texts on the Internet to establish a corpus of a certain scale.As a very common form of text on the Internet,we can use online news to discover the habits of Internet language usage on the Internet,as well as valuable information such as news trends and changes.In summary,we can use the online news corpus to explore the rules and patterns of many languages that were not noticed by theoretical techniques,and use the corpus to conduct many natural language-related scientific research.Therefore,the research of online news corpus is of great value.In addition,with the continuous development of computer application technology,coupled with the continuous improvement of personal computer performance,it is also possible to make full use of Internet resources to build a corpus suitable for their needs.Based on the above,this topic uses web crawler technology to crawl about 2 million articles in eight categories of network news in the last five years to complete the construction of a network news corpus.At the same time,a distributed retrieval system based on Elasticsearch was designed and implemented.The distributed retrieval system is based on the B/S architecture and follows the MVC software design specification,and the retrieval effect is excellent.The main work of this paper is as follows:First,study and study the most important technology in the construction of online news corpus-the technical principle of web crawlers,and other technical principles involved in web crawlers.Second,study and study the theoretical basis of full-text search technology.Including the technical principle of full-text search,word segmentation algorithm,the principle of inverted index and its significance in full-text search,learning tf-idf weight calculation to measure the importance of terms for a news document,and how to use vector space model To solve the problem of calculating text similarity by using the related theory of vectors in linear algebra.Third,the design and completion of the network news corpus construction,design and implementation of a distributed retrieval system based on Elasticsearch.Finally,through the above work,we have completed a construction work with a sufficient representative,high-quality,large-scale network news corpus,and realized a distributed retrieval system with fast retrieval response and high availability.
Keywords/Search Tags:corpus construction, full-text search system, web crawler, Elasticsearch
PDF Full Text Request
Related items