Deep Web Data Resource Intelligent Mining System

Posted on:2020-06-02

Degree:Master

Type:Thesis

Country:China

Candidate:X H Wang

Full Text:PDF

GTID:2428330578968732

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

With the development of the Internet and the rapid advancement of information technology,the Internet has become an important way for people to obtain information.In the network environment,compared with surface web.Deep Web contains more information in higher quality and is generally structured.However,the information in Deep Web cannot be accessed directly.Therefore,it is necessary to study how to obtain and utilize the information in Deep Web.In order to utilize the information in Deep Web efficiently,this thesis proposed a data source discovery method based on random forest and a data source classification method based on text feature expansion and extraction.The main research contents and achievements of the article can be summarized as follows:In order to efficiently use the information in Deep Web,it is first necessary to find the location of the data source query interface,and then classify according to the content of the data source.This thesis focuses on how to automatically discover and classify Deep Web data sources.The main research contents and achievements of the article can be summarized as follows:(1)A method for discovering Deep Web data sources based on random forest model is proposed.First of all,by analyzing the code and structure of the web page,a series of web form features are summarized.Then,based on these features,a random forest model is built to distinguish the Deep Web data source from other web page forms to achieve the purpose of discovering the Deep Web data source.Finally,experiments were performed on the UIUC TEL-8 dataset.The experimental results show that the proposed method can accurately complete the discovery of data sources.(2)For the sparse feature of the Deep Web data source due to less text.A feature expansion method based on N-gram model is proposed.At the same time,due to the possibility of introducing new noise during feature expansion,this thesis uses Word2Vec for noise cancellation.The experimental results show that the feature expansion method can effectively solve the classification problem of data sources with less text,and the addition of noise control mechanism further improves the accuracy of classification.(3)A data source feature extraction and classification method based on Attention-based Bi-LSTM model is proposed.Bi-LSTM can obtain the context semantic information of the text,which is very suitable f'or processins text data.The attention mechanism assigns bigger weight to words that are more relevant to the text subject,making the vector representation of the text more accurate.(4)Based on the model and algorithm proposed in this thesis,the Deep Web data resource intelligent mining system is implemented.The system can automatically discover and classify Deep Web data sources from web pages,and finally establish a Deep Web data source directory.

Keywords/Search Tags:

Deep Web, Data source discovery, Data source classification, Feature extension, Attention mechanism, Deep learning

PDF Full Text Request

Related items

1	The Relevant Technologies Research On Deep Web Source Discovery
2	Research On Discovery And Classification Based On Topic-related Data Sources
3	Research On Deep Web Source Discovery And Classification
4	Research On Data Fusion Method Based On Deep Learning
5	Research On Deep Web Sources Classification
6	Research On Deep Web Data Source Discovery And Sampling
7	Research On Deep Web Data Source Selection Method Based On Sampling
8	The Research On Technology Of Deep Web Source Discovery And Semantic Annotation
9	Research On Deep Web’s Data Source Automatically Identify And Classification
10	Research On Short Text Classification Method Based On Feature Extension