Font Size: a A A

Research On Key Techniques Of Distributed Acquisition And Processing Of Scientific Research Personnel Information

Posted on:2019-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:L LiuFull Text:PDF
GTID:2428330548976365Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,the transformation of scientific and technological achievements has become an industry that the country has focused on developing and supporting.Enterprises have great demands for the transformation of scientific and technological achievements.It is of great realistic significance to build a search engine for scientific and technological talents that meets the actual needs of enterprises.How to ensure the comprehensiveness,completeness,and accuracy of all information will be an important prerequisite for the search engine for high-efficiency technology talents,and it is also a key issue for this study.For massive scientific and technological personnel information,the traditional single-machine or multi-threaded reptile architecture has low data collection efficiency,and it is difficult to meet the large-scale data collection requirements of the entire network.In addition,due to the heterogeneity of data,the acquired scientific and technological personnel data often have some noise,such as the ambiguity phenomenon of the same name of scientific and technological talents,and the accuracy of data cannot be guaranteed.In order to solve the above problems,this paper starts with related research on improving web crawler collection efficiency and elimination of ambiguity of the same name,and proposes: Hadoop-based distributed data collection platform for improving the collection efficiency of massive scientific and technological talent information;The same name disambiguation method of the strategy combination model is used to solve the ambiguity problem of the scientific name talents with the same name.The main research work of this paper is as follows:(1)This paper designs and implements a distributed acquisition platform for scientific and technological personnel based on Hadoop.The acquisition platform is designed from four aspects: physical architecture,logical architecture,workflow and functional modules,and implemented on the basis of the Hadoop platform.Through this platform,a large amount of scientific and technological personnel related information was collected,including academic papers,patents,scientific research items,and scientific and technical personnel personal information.(2)Science and technology talent information raw data preprocessing.The collected academic papers,patents,scientific research project data are standardized,and the unstructured data of personal profiles are extracted.Finally,the processed data is integrated and stored in the database,which further improves the scientific and technological personnel information.(3)Research proposes a disambiguation method with the same name based on a multi-strategy combination model.For the ambiguous phenomenon of the same name with different names in the same scientific and technological personnel,the method analyzes the multiple strategies from the entity connection,the time window of achievement,the outcome coauthor,and the achievement similarity,determines their judgment rules,and proposes their combination.model.At the same time,when calculating the similarity of results,the latest word vector model Word2 vec is used to represent the text,and the accuracy of the representation is improved by constructing a corpus of science and technology domains.Finally,this paper applies the same name disambiguation method of distributed acquisition and processing technology and multi-strategy combination model to the search and recommendation platform of scientific and technological talents to verify the validity and feasibility of the above-mentioned theoretical research work.Results transformation provides effective data protection.
Keywords/Search Tags:achievement transformation, distributed acquisition, multi-strategy combination, disambiguation with the same name
PDF Full Text Request
Related items