Research And Implementation Of Topic Crawler In The Field Of Inspection And Quarantine

Posted on:2018-12-21

Degree:Master

Type:Thesis

Country:China

Candidate:H Zhou

Full Text:PDF

GTID:2348330518473516

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In recent years,The global information data grows rapidly.According to International Data Corporation(IDC),The annual data growth rate will remain at around 50%in the future.The global total data will reach 40ZB by 2020.In this condition,The accuracy and depth of information requirements are increasing for those users in the ocean of data,especially for the special needs of the professional fields.The information that collected by common search engine is complex and inaccurate.In view of this,this paper,based on the vertical search engine,explores the data acquisition and search technology,and achieves the specific data acquisition and search subsystem based on specific medical and disease subject areas.The main contributions of this paper are as follows:1.Outline the key technologies in the process of the web crawler realization,such as web denoising,text extraction,massive URL and document filtering,NoSQL database.In addition,in order to deal with the dynamic analysis and download of web pages,this paper proposes a JavaScript parsing strategy based on protocol control.2.This paper discusses the web page crawling strategy based on network topology,web page text and user access behavior.After comparing their advantages and disadvantages,this paper proposes an optimized web page crawling strategy based on URL density clustering,which is adopted in the concrete implementation of the search subsystem.3.Based on the Word2vec and the deep learning method,this paper proposes a hierarchical long-short term memory network which use attention mechanism for text classification task,the model extracts characteristics of the whole text from the dimension of the word and sentence.After the neural network is trained,we also test and analyze the performance of the classification network on the open data set.4.This paper realizes the data acquisition subsystem and data search subsystem in the field of inspection and quarantine.The bottom of the whole system adopts distributed deployment in order to improve computing performance and system stability.The service of data collection,cleaning,storage,classification and indexing are deployed in a distributed environment consisting of multiple servers.

Keywords/Search Tags:

web crawler, data retrieval, deep learning, text classification, word embeddings

PDF Full Text Request

Related items

1	The Study And Application Of Text Embeddings With Deep Learning Technique
2	Research Of Sentiment Classification Based On Attention Word Embeddings
3	Word Embeddings Towards Text Classification Of Emotion And Topic
4	Research On Text Classification Algorithms Based On Machine Learning
5	Research On Chinese-korean Cross-lingual Text Classification Method Based On Bilingual Topical Word Embedding Model
6	Jointly Learning Chinese Word Embeddings With Heterogeneous Morphemes
7	Research On Sentence Classification Based On Deep Learning And Feature Embedding
8	The Research And Implementation Of Microblog Retrieval System Based On Word Embeddings
9	Research On Word-level Interactive Text Classification Combined With Self-attention Mechanism
10	Research On Open Relation Extraction And Classification Based On Word Embeddings