Font Size: a A A

Research On The Parallelization Technology Of Knowledge Graph Construction

Posted on:2021-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:F L WangFull Text:PDF
GTID:2428330620464059Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of the Internet and various industries,data has also shown explosive growth.Under the big data environment,how to quickly find valuable information from massive data and efficiently extract knowledge from it to form a graph is an urgent problem to be solved.This thesis analyzes the entire construction process of event-based knowledge graphs and knowledge graphs based on chapter understanding.A variety of parallel data processing methods are designed in conjunction with existing technologies,and these methods are applied to each stage of the knowledge graph construction process.The main work of this thesis is as follows:(1)In the data collection stage,in order to quickly collect data and perform preliminary processing on the data,a distributed data collection architecture based on the master-slave model was designed and implemented.Each node in the architecture uses the message queue as message middleware for communication and data transfer.Using this architecture can flexibly configure nodes and efficiently collect data.(2)In the knowledge extraction phase,entities and relationships are extracted from the collected chapter/event data.In order to cope with the multiple algorithms and data sets of different sizes in the extraction process,three data parallel processing methods are designed based on Spark and message queue.Experiments show that in the experimental environment of the article,according to different algorithm scenarios,the parallelization method can be reasonably selected,which can improve the efficiency of knowledge extraction by about 13 times compared with single node processing.(3)In the knowledge representation stage,in order to solve the shortcomings of the traditional knowledge representation method,a representation learning method that maps the knowledge graph to a vector space is adopted.Analyze the existing distributed deep learning framework and apply it to representation learning and deep learning model training.Experiments show that using the experimental environment and parallelization method in the thesis can improve the representation learning efficiency by about 5 times compared to single node processing.(4)In the knowledge processing stage,the parallelization method of co-occurrence relation discovery related algorithm was mainly analyzed.In order to efficiently find the relationship between entities and the association between chapters / events,a parallelassociation network construction and cluster fusion method are designed.At the same time,a more efficient method is designed for the related network construction algorithm based on text entity similarity calculation,and the cluster fusion algorithm is optimized to reduce the amount of calculation.Experiments show that compared with single node processing,the article cluster environment can improve the construction efficiency of the association network by about 9 times and the hierarchical clustering fusion algorithm by about 4 times.(5)In addition,in order to facilitate the use of cluster management and parallel algorithms,a knowledge graph parallel algorithm management platform is designed based on the Web framework.The platform not only provides a UI interface for parallel algorithm startup,but also visually monitors the cluster status in real time.
Keywords/Search Tags:parallelization, knowledge graph, data collection
PDF Full Text Request
Related items