Research On Intelligent Crawler Technology Of Web Vulnerability Scanner

Posted on:2013-02-15

Degree:Master

Type:Thesis

Country:China

Candidate:L Huang

Full Text:PDF

GTID:2218330371461819

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

At present,The most commonly used technique is the use of Web vulnerability scanner fordetecting vulnerabilities against Web security problemsï¼ŽWeb crawler is an important part of Webvulnerability scanner,which is responsible for grabbing the information of the website's pages toprovide data source and scanning entrance for Web vulnerability scannerï¼ŽWeb crawler is a smartprogram for crawling pagesï¼ŽAnd this paper mainly researches Web crawler technologyï¼ŽThe major work done includes several aspects as follows:Firstly,three typical Web crawlers are studied and their crawling strategies arelearnedï¼ŽSeveral important algorithms are discussed,the existing Web vulnerability scanners basedon Web crawler technology are analyzed,four features of the scanning object are summarizedï¼ŽSecondly,a method to extract the Web data based on attribute tag of Web pages through theanalysis of the features of scanning objectï¼ŽIt uses tags of Web pages to construct a DOM tree withthe attribute tagï¼›child trees are compared by attribute tags to find tag sequence's repeative patternsï¼›making three rules is to remove distrubed patterns and identify data regions,and the vector is usedto record repeative patternï¼›datas are extracted through the vectorï¼ŽExperiments are done to verifythe effectiveness of the method,and the experiment object is commodities of Amazonï¼ŽAccording tothe experiment data,this method can extract about 90% of the data in Amazon webpagesï¼ŽBothaccuracy and coverage are very highï¼ŽThirdly,the method to extract the Web data based on attribute tag of Web pages can extract thedata from most webpages,but it doesn't work when repeative pattern is just similar but notsameï¼ŽThe Web data mining algorithm based on edit distance is proposed to solve this problemï¼ŽItcomputes tree edit distance through string edit distance,uses string edit distance to access similaritybetween one tree and another,then finds repeative patterns in webpages and mines datasï¼ŽIt isdemonstrated by the experiments done for webpages with the different features of repeativepattern,that this algorithm not only mines the data from webpages of Featrue One but also the datafrom webpages of Featrue Twoï¼ŽIt extraces all of the 1000 datas from 20 BaiduTieba webpagesï¼ŽFinally,an intelligent crawler is designed and implementedï¼ŽIts modules are described and theflow chart of each module is drawedï¼ŽThe crawler is programed in Java and experiments prove thatevery module to achieve the intended functionï¼ŽThe crawler,which applies new algorithmsproposed by this paper to the formulation of crawling strategy,can grab webpages well fromwebsites with strong interactivity such as electronic commerce websites,Tieba,BBS and so onï¼Ž...

Keywords/Search Tags:

Web security, Web crawler, Data mining, Repeative pattern, Edit distance

PDF Full Text Request

Related items

1	The String Pattern Matching Algorithm Based On Edit Distance
2	Approximate Pattern Matching With Gap Constraint Under The Edit Distance
3	Web Information Extracting Based On Tree Edit Distance
4	Improved Edit Distance Algorithm And Its Application In E-government
5	Research And Realization Of Web Crawler And Results Clustering In Search Engine
6	Research Of Methods Of Data Cleaning For Hotel Entity Based On Edit Distance And Conditional Functional Dependencies
7	Based On Pattern Fusion Mining Frequent Co-location Long Mode
8	Study On Fast Algorithms For Edit Distance
9	Fuzzy Matching Based On Edit Distance Algorithm Of Chinese Technology In The Environment Of A Large Amount Of Data
10	Research On The Solving Technology Of Graph Edit Distance