Design And Implementation Of Search System Based On Scrapy-redis And GMM

Posted on:2020-12-06

Degree:Master

Type:Thesis

Country:China

Candidate:X Q Bu

Full Text:PDF

GTID:2518306473485364

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The rapid development of the Internet era brings not only opportunities but also challenges.The network data is expanding and massive information is gathering,but the acquisition of valuable information is really difficult according to our demand.Information retrieval is still a hot topic.The essence of a search system can be said to be a huge information retrieval system.The web crawler is an important member of the search system.Its main job is to crawl and store webpage data from the Internet.However,this reptile method is single and inefficient,also no longer meets the increasing demand for data.Therefore,distributed crawlers have gradually become a research hotspot.At the same time,through the combination of data mining analysis and search system,it is possible to retrieve valuable information more effectively,which is very research-oriented for individuals and information providers in the era of big data.This thesis firstly selects the master-slave crawler structure that is simple operation and convenient coordination,as well as the breadth-first traversal crawler strategy that facilitates multi-reptile collaborative crawling by studing the distributed crawler framework technology.Then the study also analyzes the performance of the Bloom Filter algorithm and comparison of different improved Bloom Filters,finally an i-DBF-based multi-dimensional K-segment mapping improved Bloom Filter suitable for the distributed crawler de-duplication module is introduced and integrated into the de-duplication function of the crawler framework Scrapy-redis scheduler module to reduce the memory space occupied during data crawling.Secondly,taking the online shopping platform information data of the climbed as an example,this thesis analyzes the information data provided by each store from the perspective of online shopping platform.The data source here is the mushroom street women's shop information.From the perspective of information providers,it determines the mining targets for grouping stores and establishes a store value evaluation model.At the same time,according to the analysis of Gaussian mixture model(GMM)clustering algorithm,the EM algorithm(Expectation Maximization Algorithm)is used to estimate its parameters.The problem of slow convergence and initial value sensitivity of EM algorithm is studied and comparative experiments are carried out.It is determined that the IAFSA-Kmeans-EM algorithm is used for parameter estimation,which is optimized for convergence speed and global search.Then in this thesis,through the combination of data mining analysis and search system,the final clustering analysis results are applied to the sorting function of search results,which provides reference for users to further search.Finally,the function of search is implemented by the collected data based on Elastic Search.Through data crawling experiments,it can be seen that distributed crawlers are significantly more efficient than stand-alone crawlers,and data mining analysis combined with search system can effectively retrieve more valuable information.

Keywords/Search Tags:

Search System, Distributed Crawler, BloomFilter, GMM, IAFSA-Kmeans-EM

PDF Full Text Request

Related items

1	Design And Implementation Of Distributed Crawler System Based On Docker Cluster
2	Research Of Distributed Web Crawler Based On Hadoop
3	Design And Implementation Of A Distributed Crawler System Based On Scrapy Framework
4	Distributed Web Crawler System
5	Design And Implementation Of Distributed Network Crawler System
6	Design And Implementation Of Distributed Online Travel Search Crawler System
7	Distributed Web Crawler System Design And Implementation
8	Design And Implementation Of A Movie Search System Based On Distributed Crawler
9	Research Of A Distributed Web Crawler Search Engine Based On Web Information Collection
10	The Research On Web Crawler Technology Based On Distributed Calculation