Font Size: a A A

Identification And Information Extraction Of Scholars' Homepages

Posted on:2021-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:Q Y ZhangFull Text:PDF
GTID:2518306503472624Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the development of science and technology,the number of scholars has increased rapidly.Due to the frequent communications between scholars and the applications of scholars' personal information in many fields,how to obtain scholars' information accurately and quickly becomes very important.As an academic big data platform and a visual map-type academic search system,Acemap has no scholar information data at present.Based on the actual situation of Acemap,this paper studies the use of computer methods to automatically obtain scholars' information from the Internet.This paper divides this process into three parts,namely network data collection,scholar homepage identification and scholar homepage information extraction.Each part is studied and implemented separately.The main contributions and innovations of this paper include:First,network data collection.This paper builds a high-performance web crawler that can easily collect data from different websites.In addition,a variety of anti-anti-crawling measures are adopted to ensure the robustness of the crawler.Second,scholar homepage identification.In this paper,the identification of scholars' homepages is regarded as a binary classification problem.Features are extracted from the titles,links,and abstracts of scholars' Google search results obtained by the crawler,and XGBoost is used for learning and prediction.In this paper,the web page with the highest prediction probability is treated as the scholar's personal homepage.The experimental results show that the method has identified 95.83% of the scholars' homepages on the autonomously labeled dataset.Third,scholar homepage information extraction.This paper translates this task into a problem of sequence labeling.Based on the currently widely used Bi LSTM-CRF model,this paper analyzes the shortcomings of several existing vector representation methods that cannot handle the situation of”polysemy”,and proposes BERT-Bi LSTM-CRF model,which uses BERT's deep bidirectional Transformer structure to obtain the vector representation of the input text,so that the generated vector contains its context information.The experimental results show that the BERT-Bi LSTM-CRF model proposed in this paper has better labeling results.Based on this,for labels with obvious characteristics such as email,phone and fax,this paper modifies the labeling results by using regular expressions and some simple rules,which further improves the labeling effect of these three labels.Fourth,based on these three parts,this paper summarizes the overall process of automatically obtaining scholars' information from the Internet,and a group of scholars' personal information was obtained,which was supplemented into the Acemap database.
Keywords/Search Tags:Web Crawler, Homepage Identification, Information Extraction, Sequence Labeling
PDF Full Text Request
Related items