| At present,The most commonly used technique is the use of Web vulnerability scanner fordetecting vulnerabilities against Web security problems.Web crawler is an important part of Webvulnerability scanner,which is responsible for grabbing the information of the website's pages toprovide data source and scanning entrance for Web vulnerability scanner.Web crawler is a smartprogram for crawling pages.And this paper mainly researches Web crawler technology.The major work done includes several aspects as follows:Firstly,three typical Web crawlers are studied and their crawling strategies arelearned.Several important algorithms are discussed,the existing Web vulnerability scanners basedon Web crawler technology are analyzed,four features of the scanning object are summarized.Secondly,a method to extract the Web data based on attribute tag of Web pages through theanalysis of the features of scanning object.It uses tags of Web pages to construct a DOM tree withthe attribute tag;child trees are compared by attribute tags to find tag sequence's repeative patterns;making three rules is to remove distrubed patterns and identify data regions,and the vector is usedto record repeative pattern;datas are extracted through the vector.Experiments are done to verifythe effectiveness of the method,and the experiment object is commodities of Amazon.According tothe experiment data,this method can extract about 90% of the data in Amazon webpages.Bothaccuracy and coverage are very high.Thirdly,the method to extract the Web data based on attribute tag of Web pages can extract thedata from most webpages,but it doesn't work when repeative pattern is just similar but notsame.The Web data mining algorithm based on edit distance is proposed to solve this problem.Itcomputes tree edit distance through string edit distance,uses string edit distance to access similaritybetween one tree and another,then finds repeative patterns in webpages and mines datas.It isdemonstrated by the experiments done for webpages with the different features of repeativepattern,that this algorithm not only mines the data from webpages of Featrue One but also the datafrom webpages of Featrue Two.It extraces all of the 1000 datas from 20 BaiduTieba webpages.Finally,an intelligent crawler is designed and implemented.Its modules are described and theflow chart of each module is drawed.The crawler is programed in Java and experiments prove thatevery module to achieve the intended function.The crawler,which applies new algorithmsproposed by this paper to the formulation of crawling strategy,can grab webpages well fromwebsites with strong interactivity such as electronic commerce websites,Tieba,BBS and so on.... |