Identification And Information Extraction Of Scholars' Homepages

Posted on:2021-03-23

Degree:Master

Type:Thesis

Country:China

Candidate:Q Y Zhang

Full Text:PDF

GTID:2518306503472624

Subject:Electronics and Communications Engineering

Abstract/Summary:

PDF Full Text Request

With the development of science and technology,the number of scholars has increased rapidly.Due to the frequent communications between scholars and the applications of scholars' personal information in many fields,how to obtain scholars' information accurately and quickly becomes very important.As an academic big data platform and a visual map-type academic search system,Acemap has no scholar information data at present.Based on the actual situation of Acemap,this paper studies the use of computer methods to automatically obtain scholars' information from the Internet.This paper divides this process into three parts,namely network data collection,scholar homepage identification and scholar homepage information extraction.Each part is studied and implemented separately.The main contributions and innovations of this paper include:First,network data collection.This paper builds a high-performance web crawler that can easily collect data from different websites.In addition,a variety of anti-anti-crawling measures are adopted to ensure the robustness of the crawler.Second,scholar homepage identification.In this paper,the identification of scholars' homepages is regarded as a binary classification problem.Features are extracted from the titles,links,and abstracts of scholars' Google search results obtained by the crawler,and XGBoost is used for learning and prediction.In this paper,the web page with the highest prediction probability is treated as the scholar's personal homepage.The experimental results show that the method has identified 95.83% of the scholars' homepages on the autonomously labeled dataset.Third,scholar homepage information extraction.This paper translates this task into a problem of sequence labeling.Based on the currently widely used Bi LSTM-CRF model,this paper analyzes the shortcomings of several existing vector representation methods that cannot handle the situation of�polysemy�,and proposes BERT-Bi LSTM-CRF model,which uses BERT's deep bidirectional Transformer structure to obtain the vector representation of the input text,so that the generated vector contains its context information.The experimental results show that the BERT-Bi LSTM-CRF model proposed in this paper has better labeling results.Based on this,for labels with obvious characteristics such as email,phone and fax,this paper modifies the labeling results by using regular expressions and some simple rules,which further improves the labeling effect of these three labels.Fourth,based on these three parts,this paper summarizes the overall process of automatically obtaining scholars' information from the Internet,and a group of scholars' personal information was obtained,which was supplemented into the Acemap database.

Keywords/Search Tags:

Web Crawler, Homepage Identification, Information Extraction, Sequence Labeling

PDF Full Text Request

Related items

1	Research On Text Causality Extraction Based On Deep Learning And Sequence Labeling
2	Research On Object Extraction Of Automobile Product Based On Sequence Labeling
3	Research On Event Extraction Algorithm Based On Sequence Labeling Model
4	Research On The Ontology-based Information Extraction For Personal Homepage
5	Research On Emotion-cause Pair Extraction Based On Sequence Labeling And Transition Methods
6	Research And Implementation Of Information Extraction System For Merger And Acquisition Announcement
7	Literature Information Extraction System From Academic Homepage
8	The Method Of Extracting Complex Indicators From Long Text
9	Research And Implementation Of Entity Relation Extraction Algorithm In News Field Based On Distant Supervision And Seouence Labeling
10	Research On The Identification Approach Of Opinion Element Orientation For Chinese Comparative Sentences