Font Size: a A A

Research On Data Extraction For Agency Website

Posted on:2019-07-04Degree:MasterType:Thesis
Country:ChinaCandidate:J N LiuFull Text:PDF
GTID:2428330566998109Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the popularity of the Internet,the amount of information carried on the Internet is increasing day by day.Data mining for agency website has become a hot research topic in the field of WEB information mining.At present,there are many researches on the classification of webpages,however,they lack the clear and explicit expression of the structural features of web pages and mostly adjust the weight of textual features according to structural features.At the same time,it involves little in the field of the agency website,and caused a lack of extraction methods for agency website information.Due to the lack of an open data set,a large-scale collection and preprocessing of agency website is performed using a distributed collection method.The useless HTML tags are removed to complete the construction of the initial data set.The agency website is observed and summarized in 3 categories and 13 subcategories.Experts mark the webpage according to the category of the webpage,then use the effective character-based subject information block positioning method to find the position of the main information displayed on the webpage,and proposes 9 types of web page structure and content features such as the maximum proportion of the effective characters of the main information sublabel,the largest difference between the effective characters of information subtags and the proportion of character information characters.These features are extracted using feature engineering,and the support vector machine(SVM)algorithm is used to classify the models.It constructs and classifies web pages.Finally,it extracts data sets for the results of web page classification.It observes and analyzes the characteristics of the information to be extracted in the agency website,and proposes two methods,information extraction based on trigger rules and information extraction based on LSTM.and then the structured extraction results are obtained.In this paper,the classification models constructed by the three algorithms of decision tree,neural network and support vector machine are compared and analyzed using the evaluation index of accuracy,recall,etc.The experiments show that the support vector machine algorithm works well and the parameters of the support vector machine model were optimized;According to the principle of single variable,using the optimal support vector machine classification model for feature comparison experiments,experiments show that each feature has a positive effect on the agency website page classification,and shows that the web page structure features and web content features In combination,the classification of the agency website can be completed more accurately.Finally,experiments were conducted on the basis of trigger rule information extraction and information extraction based on LSTM.The results of the extraction were statistically analyzed and compared with the existing algorithms.Experiments have shown that both methods have achieved good results in extracting official website information.
Keywords/Search Tags:agency website, web category, information extraction, structure, content characteristics, support vector machinme
PDF Full Text Request
Related items