Font Size: a A A

Research And Application Of Event Knowledge Graph Parallelization

Posted on:2020-12-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y M LuoFull Text:PDF
GTID:2428330596976793Subject:Engineering
Abstract/Summary:PDF Full Text Request
As the foundation of the human cognitive world,events have attracted the attention of more and more researchers.The development of knowledge graph provides a carrier for computer formal description of real world things,and the construction of event-based knowledge graph has become one of the researchers'attention.While the development of the Internet has changed the way of life of human beings,it has also brought huge data scales,which brings great performance challenges to the construction of event knowledge graph with the Internet as the data source.Based on the improvement of the efficiency of event knowledge graph construction,this paper analyzes some key technologies that affect the performance of each stage of the graph construction,and proposes a parallelization solution based on Spark.The specific research includes the following aspects:1.In the text feature extraction phase,the research work on the feature extraction of large-scale data for text feature has been carried out from two aspects.(1)In order to improve the speed of word segmentation in large-scale Chinese text data,the parallel Chinese word segmentation method based on Spark is proposed.(2)Taking parallel Chinese word segmentation results as input,in the Spark MLlib library Word2Vec training implementation scheme,the optimization scheme of improving Word2Vec training performance,LB-Word2Vec,is studied and proposed.Through a series of comparative experiments,it is proved that the above two studies have achieved good results.In the cluster with 6 computing nodes,the parallel Chinese word segmentation is about 3 times faster than the single-machine word segmentation.The word vector model trained by LB-Word2Vec is in Under the premise of ensuring that the accuracy rate is basically unchanged,the speed of Word2Vec is nearly 3 times faster than that of the unoptimized parallel Word2Vec,which is nearly 5 times faster than that of the single-machine Word2Vec.2.In the text filtering phase,the text filtering algorithm with time complexity of O(n~2)has bottlenecks in performance under the increasing data scale.In order to improve the speed of text filtering algorithm,the parallel text filtering algorithm is realized based on Spark and the performance of the implementation scheme is optimized.After a series of comparative experiments,proved that the parallel text filtering algorithm does not reduce the time complexity of the algorithm,but its performance is superior.In a cluster with 2 to 6 compute nodes,the parallel text filtering algorithm is 2 to 5 times faster than the single machine method.3.In the extraction event phase,aiming at the challenge of large-scale data to the efficiency of event extraction,based on TensorFlowOnSpark,the existing event extraction algorithm is improved and a customized parallel event extraction platform is implemented.The experimental results show that although the speed of parallel event extraction algorithm is not obviously improved due to the limitation of data input mode in the model training stage,the accuracy of the model is basically the same as that of the single machine.In the event extraction stage based on the trained model,the proposed data distribution mechanism in the cluster with two computing nodes,the event extraction speed is about 2 times faster than the single machine.4.Based on the Play2 framework,a parallel computing platform was designed and implemented.The platform mainly realizes the functions of visually submitting and managing Spark jobs,providing parallel computing services to the outside,avoiding the complexity of submitting jobs from the command line,and facilitating the management of the job on the Web side and the access of the external environment to the parallel computing service.
Keywords/Search Tags:Event Knowledge Graph, Spark Framework, Parallelization, Data Parallelism, Performance Optimization
PDF Full Text Request
Related items