Research On Data Extraction For Agency Website

Posted on:2019-07-04

Degree:Master

Type:Thesis

Country:China

Candidate:J N Liu

Full Text:PDF

GTID:2428330566998109

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With the popularity of the Internet,the amount of information carried on the Internet is increasing day by day.Data mining for agency website has become a hot research topic in the field of WEB information mining.At present,there are many researches on the classification of webpages,however,they lack the clear and explicit expression of the structural features of web pages and mostly adjust the weight of textual features according to structural features.At the same time,it involves little in the field of the agency website,and caused a lack of extraction methods for agency website information.Due to the lack of an open data set,a large-scale collection and preprocessing of agency website is performed using a distributed collection method.The useless HTML tags are removed to complete the construction of the initial data set.The agency website is observed and summarized in 3 categories and 13 subcategories.Experts mark the webpage according to the category of the webpage,then use the effective character-based subject information block positioning method to find the position of the main information displayed on the webpage,and proposes 9 types of web page structure and content features such as the maximum proportion of the effective characters of the main information sublabel,the largest difference between the effective characters of information subtags and the proportion of character information characters.These features are extracted using feature engineering,and the support vector machine(SVM)algorithm is used to classify the models.It constructs and classifies web pages.Finally,it extracts data sets for the results of web page classification.It observes and analyzes the characteristics of the information to be extracted in the agency website,and proposes two methods,information extraction based on trigger rules and information extraction based on LSTM.and then the structured extraction results are obtained.In this paper,the classification models constructed by the three algorithms of decision tree,neural network and support vector machine are compared and analyzed using the evaluation index of accuracy,recall,etc.The experiments show that the support vector machine algorithm works well and the parameters of the support vector machine model were optimized;According to the principle of single variable,using the optimal support vector machine classification model for feature comparison experiments,experiments show that each feature has a positive effect on the agency website page classification,and shows that the web page structure features and web content features In combination,the classification of the agency website can be completed more accurately.Finally,experiments were conducted on the basis of trigger rule information extraction and information extraction based on LSTM.The results of the extraction were statistically analyzed and compared with the existing algorithms.Experiments have shown that both methods have achieved good results in extracting official website information.

Keywords/Search Tags:

agency website, web category, information extraction, structure, content characteristics, support vector machinme

PDF Full Text Request

Related items

1	Research On Improved Support Vector Machine Based On Category Imbalanced Dataset
2	The Adaptive Web Information Extraction Based On Single DOM Tree Characteristics And Classification
3	Support Vector Machine Integration And Application In The Music Category
4	The Design And Implementation Of University Information Website Platform
5	Research On Structure Support Vector Machine Classification Models
6	The Application Of Multi-category Support Vector Machine In Credit Rating And Study Of Kernel Parameter Selection
7	Research And System Implementation Of Website Content Security Monitoring Based On SVW
8	The Study Of Melody Extraction In Query By Humming
9	Reasearch On Key Technologies About Labeling The Content Of Internet Websites By Using Multi-tag
10	Application And Research Of Information Filtering Technology In Website Information Supervision