Font Size: a A A

Design And Implementation Of A Highly Adaptive Domain Oriented Crawler System

Posted on:2023-11-17Degree:MasterType:Thesis
Country:ChinaCandidate:D B LiFull Text:PDF
GTID:2568306914483644Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Nowadays,with the increasing scale of the Internet,the data on Web pages also contains great value.Extracting structured information from web pages has become particularly important.It is not only the key to the construction of large-scale knowledge base,but also produces a large number of downstream applications,such as knowledge aware question answering,personalized recommendation system,e-commerce product search and so on.However,there are two main problems in the existing crawler programs to obtain data from web pages.The first is that the change of web page structure will lead to the failure of extracting data and the failure of obtaining correct data;Then,when crawling different websites in the same field,we need to fully write extraction rules,which has a high degree of repetition and consumes a lot of manpower.Therefore,this paper proposes a new web data extraction framework to identify data from the perspective of semantics,which is no longer completely dependent on the structure of web pages,and can adapt to the changes of web pages.In this paper,web pages are divided into list pages and detail pages according to different page structures and information display forms.In the framework of Web data extraction,first,the web page type classification model based on support vector machine is used to classify the pages,and then different extraction algorithms are used for different types of Web pages.A list information extraction algorithm based on tree similarity is proposed for list pages,and a detail page extraction algorithm based on DOM tree structure and field name positioning is proposed for detail pages.Finally,some experiments are carried out.The experimental results show that the web page type classification algorithm can classify web pages with high accuracy.The two extraction algorithms can obtain complete structured data with high extraction accuracy,and can adapt to the structural changes of web pages.It shows that the proposed web data extraction framework can meet the requirements of high adaptability and domain universality on the premise of ensuring the data quality.Based on the web data extraction framework,the design and implementation of the crawler system are also completed,including system requirements analysis,overall design,detailed design and the implementation of each functional module.Finally,the functional and non functional tests of the system are carried out,and the system can meet the data collection needs of users.
Keywords/Search Tags:Web page collection, Structured data extraction, Domain oriented, Support vector machine
PDF Full Text Request
Related items