Font Size: a A A

Research And Implementation Of Topic Crawler In The Field Of Inspection And Quarantine

Posted on:2018-12-21Degree:MasterType:Thesis
Country:ChinaCandidate:H ZhouFull Text:PDF
GTID:2348330518473516Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In recent years,The global information data grows rapidly.According to International Data Corporation(IDC),The annual data growth rate will remain at around 50%in the future.The global total data will reach 40ZB by 2020.In this condition,The accuracy and depth of information requirements are increasing for those users in the ocean of data,especially for the special needs of the professional fields.The information that collected by common search engine is complex and inaccurate.In view of this,this paper,based on the vertical search engine,explores the data acquisition and search technology,and achieves the specific data acquisition and search subsystem based on specific medical and disease subject areas.The main contributions of this paper are as follows:1.Outline the key technologies in the process of the web crawler realization,such as web denoising,text extraction,massive URL and document filtering,NoSQL database.In addition,in order to deal with the dynamic analysis and download of web pages,this paper proposes a JavaScript parsing strategy based on protocol control.2.This paper discusses the web page crawling strategy based on network topology,web page text and user access behavior.After comparing their advantages and disadvantages,this paper proposes an optimized web page crawling strategy based on URL density clustering,which is adopted in the concrete implementation of the search subsystem.3.Based on the Word2vec and the deep learning method,this paper proposes a hierarchical long-short term memory network which use attention mechanism for text classification task,the model extracts characteristics of the whole text from the dimension of the word and sentence.After the neural network is trained,we also test and analyze the performance of the classification network on the open data set.4.This paper realizes the data acquisition subsystem and data search subsystem in the field of inspection and quarantine.The bottom of the whole system adopts distributed deployment in order to improve computing performance and system stability.The service of data collection,cleaning,storage,classification and indexing are deployed in a distributed environment consisting of multiple servers.
Keywords/Search Tags:web crawler, data retrieval, deep learning, text classification, word embeddings
PDF Full Text Request
Related items