Design And Implementation Of A Highly Adaptive Domain Oriented Crawler System

Posted on:2023-11-17

Degree:Master

Type:Thesis

Country:China

Candidate:D B Li

Full Text:PDF

GTID:2568306914483644

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Nowadays,with the increasing scale of the Internet,the data on Web pages also contains great value.Extracting structured information from web pages has become particularly important.It is not only the key to the construction of large-scale knowledge base,but also produces a large number of downstream applications,such as knowledge aware question answering,personalized recommendation system,e-commerce product search and so on.However,there are two main problems in the existing crawler programs to obtain data from web pages.The first is that the change of web page structure will lead to the failure of extracting data and the failure of obtaining correct data;Then,when crawling different websites in the same field,we need to fully write extraction rules,which has a high degree of repetition and consumes a lot of manpower.Therefore,this paper proposes a new web data extraction framework to identify data from the perspective of semantics,which is no longer completely dependent on the structure of web pages,and can adapt to the changes of web pages.In this paper,web pages are divided into list pages and detail pages according to different page structures and information display forms.In the framework of Web data extraction,first,the web page type classification model based on support vector machine is used to classify the pages,and then different extraction algorithms are used for different types of Web pages.A list information extraction algorithm based on tree similarity is proposed for list pages,and a detail page extraction algorithm based on DOM tree structure and field name positioning is proposed for detail pages.Finally,some experiments are carried out.The experimental results show that the web page type classification algorithm can classify web pages with high accuracy.The two extraction algorithms can obtain complete structured data with high extraction accuracy,and can adapt to the structural changes of web pages.It shows that the proposed web data extraction framework can meet the requirements of high adaptability and domain universality on the premise of ensuring the data quality.Based on the web data extraction framework,the design and implementation of the crawler system are also completed,including system requirements analysis,overall design,detailed design and the implementation of each functional module.Finally,the functional and non functional tests of the system are carried out,and the system can meet the data collection needs of users.

Keywords/Search Tags:

Web page collection, Structured data extraction, Domain oriented, Support vector machine

PDF Full Text Request

Related items

1	Research On Several Problems In Support Vector Machine And Support Vector Domain Description
2	Research And Implementation Of Content Oriented Web Page Classification
3	Research On Relation Extraction Of Person Entity In News Webpage
4	Research On Data Acquisition And Information Extraction Technology For Dynamic Web Applications
5	Researches On Some Problems In Nonparallel Hyperplanes Support Vector Machine And Feature Extraction
6	Research On Web Page Classification And Information Collection
7	Study On Object Tracking Algorithms Based On Structured Support Vector Machine
8	The Research Of Automatic Chinese Web Page Categorization Based On Support Vector Machine
9	Research On Some Problesm Of Support Vector Machine Learing Algorithm
10	Research And Implementation Of Communication Administration Oriented E-Government System And Web Page Classification