Font Size: a A A

Design And Implementation Of Vertical Search Engine Based On Web Crawler

Posted on:2020-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y DingFull Text:PDF
GTID:2428330596473320Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,users not only need to search for a large number of data information,but also put forward higher requirements for the accuracy and efficiency of search results.In order to meet the needs of users,vertical search engines emerge as the times require.With the advent of the era of artificial intelligence,more and more users hope to search the relevant information of artificial intelligence accurately in the Internet.Therefore,this paper designs and implements a professional and comprehensive vertical search engine for the field of artificial intelligence by strategically crawling and accurately screening massive information in the Internet.The system is mainly composed of five modules: information collection and processing,index establishment,user search,user registration and login,and background management.It can provide users with accurate subject search services.The main tasks are as follows:(1)The shortcomings of traditional Naive Bayesian classification algorithm are studied,and a Naive Bayesian classification algorithm based on Jensen-Shannon(JS)divergence feature weighting is proposed.Through further analysis,it is concluded that there is still insufficiency in using JS divergence value to represent the information provided by feature words.Therefore,JS divergence is further modified by combining word frequency,text frequency and class frequency.Different weights are assigned according to the different role of feature words in the classification results,and the improvement of Naive Bayesian algorithm is completed.Experiments show that Naive Bayesian classification algorithm based on JS divergence feature weighting is a better classification algorithm.(2)Acquisition and processing of AI-related information.The framework of Webmagic crawler is studied in depth.On the basis of it,the sub-module of topic relevance judgment of web content and the sub-module of link topic relevanceranking are added,and the topic crawler oriented to the field of artificial intelligence is realized.Firstly,the artificial intelligence thesaurus and the initial seed link set are established as the basis of the subsequent web page classification work.Secondly,according to the specific requirements of the system,the framework of Webmagic is redeveloped,and the main functions of web page downloading,parsing,extracting and persistence are realized.The feature words in thesaurus are regarded as the feature attributes of web page classification.Naive Bayesian algorithm based on JS divergence feature weighting is used to judge the topic relevance of web page content.At the same time,PageRank algorithm is used to quantify the importance of links in Web pages,so as to achieve the ranking of link topic relevance in order to crawl high-quality links.(3)Establishing index and completing user search.The relevant web page information crawled is imported into Solr server,and IK Analyzer participler is configured in Solr.The index is built with Solr server as the core,and the user search function of vertical search engine in artificial intelligence field is completed.(4)Based on SSH framework,a vertical search engine system for artificial intelligence field based on web crawler is implemented,which realizes the functions of user registration and login,background management,etc.A beautiful and exchangeable system is designed and implemented,and the system is tested effectively.
Keywords/Search Tags:Vertical search engine, Artificial intelligence, Theme crawler, Text classification, Naive Bayesian algorithm
PDF Full Text Request
Related items