Font Size: a A A

Parallelization Research And Design Of The Topic Web Crawler

Posted on:2018-01-19Degree:MasterType:Thesis
Country:ChinaCandidate:J Y WangFull Text:PDF
GTID:2358330518478291Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the popularity of mobile Internet,the speed of data generation is accelerating and the amount of data is growing.Although the search engine provides the number of query results to meet the needs of ordinary users,not enough to support researchers in the subject area of data mining and analysis.In this paper,how to obtain the subject information as a research problem,according to the actual needs,we use the crawler to collect relevant data from the Internet to efficiently.In this paper,we use the idea of cluster parallelization and the improved web similarity determination method to collect web pages and determine the relevance of the subject to obtain information.The research work is divided into three parts:crawler working principle and the related knowledge,the improvement of crawler parallelization and the relevance judgment of text subject in the data acquisition process.First of all,the crawler is an important part of the search engine,this paper considers search engine and the HTTP protocol as a starting point and then researches the crawler acquisition process.Second,on the basis of the common crawler process,a Multi Strategy fusion search algorithm is proposed based on the common search strategy,which improves the efficiency of the original search and achieves the effect of doubling the efficiency.Next,the scale of the Internet data promote we use the parallel crawler to improve efficiency,on the basis of the general crawler process,according to the needs and characteristics of each parts of the crawler,we use a suitable parallel framework:using RabbitMQ as the URL queue,memory-level database Redis removing the repeated records,using the parallel computing framework Storm process web data and other operations,and distributed database MongoDB to save data.Finally,the combination of vector space model and semantic discriminant algorithm is used to identify the theme of the web page.On the basis of the research,this paper completes the system architecture design and implementation,and then we test the system with the theme of the "big data".Through the detection of the relevant web data,the result shows that the system has certain text subject recognition.Through the parallel acquisition,the system efficiency and stability have been improved,and solve the problem of small and medium-sized crawler independent obtain the related web pages independently.Also,the data we obtained plays an important role in the subsequent analysis and process.
Keywords/Search Tags:Parallelization, Crawler, Text Processing, Subject Similarity
PDF Full Text Request
Related items