Font Size: a A A

Deep Web Data Resource Intelligent Mining System

Posted on:2020-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:X H WangFull Text:PDF
GTID:2428330578968732Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet and the rapid advancement of information technology,the Internet has become an important way for people to obtain information.In the network environment,compared with surface web.Deep Web contains more information in higher quality and is generally structured.However,the information in Deep Web cannot be accessed directly.Therefore,it is necessary to study how to obtain and utilize the information in Deep Web.In order to utilize the information in Deep Web efficiently,this thesis proposed a data source discovery method based on random forest and a data source classification method based on text feature expansion and extraction.The main research contents and achievements of the article can be summarized as follows:In order to efficiently use the information in Deep Web,it is first necessary to find the location of the data source query interface,and then classify according to the content of the data source.This thesis focuses on how to automatically discover and classify Deep Web data sources.The main research contents and achievements of the article can be summarized as follows:(1)A method for discovering Deep Web data sources based on random forest model is proposed.First of all,by analyzing the code and structure of the web page,a series of web form features are summarized.Then,based on these features,a random forest model is built to distinguish the Deep Web data source from other web page forms to achieve the purpose of discovering the Deep Web data source.Finally,experiments were performed on the UIUC TEL-8 dataset.The experimental results show that the proposed method can accurately complete the discovery of data sources.(2)For the sparse feature of the Deep Web data source due to less text.A feature expansion method based on N-gram model is proposed.At the same time,due to the possibility of introducing new noise during feature expansion,this thesis uses Word2Vec for noise cancellation.The experimental results show that the feature expansion method can effectively solve the classification problem of data sources with less text,and the addition of noise control mechanism further improves the accuracy of classification.(3)A data source feature extraction and classification method based on Attention-based Bi-LSTM model is proposed.Bi-LSTM can obtain the context semantic information of the text,which is very suitable f'or processins text data.The attention mechanism assigns bigger weight to words that are more relevant to the text subject,making the vector representation of the text more accurate.(4)Based on the model and algorithm proposed in this thesis,the Deep Web data resource intelligent mining system is implemented.The system can automatically discover and classify Deep Web data sources from web pages,and finally establish a Deep Web data source directory.
Keywords/Search Tags:Deep Web, Data source discovery, Data source classification, Feature extension, Attention mechanism, Deep learning
PDF Full Text Request
Related items