Font Size: a A A

The Intelligent Storage And Mining Of Big Scholarly Data Based On Distributed Architecture

Posted on:2019-01-01Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LuoFull Text:PDF
GTID:2417330590967384Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Scientific research is the strategic support of improving the productive forces of society and the comprehensive national strength of the country.In the world,millions of knowledge literatures are produced every year in computer science,basic science,medicine,economics and sociology.At the same time,with the rapid development and popularization of the Internet,the dissemination and sharing of knowledge literature has become very easy,thus entering the era of great academic data.In the face of such a vast academic information resource,how to store and mine it intelligently is a very important work.It mainly involves three applications of computer science,including database system,distributed computing and machine learning.This subject regards the academic search system Ace Map(also called Paper Book)as the research object,and stores the academic entities and their logical relationships by designing relational data table.It then proposes two optimization approaches to tackle the bottleneck of SQL query performance(according to the system environments of traditional relational database and distributed architecture respectively).Finally,it explores the applications of distributed machine learning framework in Ace Map.The main contributions of this dissertation include:· Utilized the Window Functions mechanism(Partitioning?Ordering?Framing)to optimize a large number of analytical SQL queries existed in the Ace Map system.The experimental results show that the optimization can improve the performance of the system to a certain extent,and can reduce the execution time of the query by 18.6 percent to the extent.· Completed the synchronous migration of some large academic data to the Hadoop Distributed File System,and applied the SQL-on-Hadoop technology framework Spark SQL to perform complex queries.At the same time,the parameters of the Spark cluster(Spark executors concerned)have been tuned based on the data set volume and the architecture of the distributed cluster.The experimental results show that the optimization can greatly improve the performance of the system,and can reduce the execution time of the query by 93.9 percent to the extent.· Applied the distributed machine learning framework Spark MLlib to mine academic topics,which has expanded and improved the ability of Ace Map system knowledge discovery.
Keywords/Search Tags:Relational Database, SQL, Window Functions, SQL-on-Hadoop, SparkSQL, Machine Learning
PDF Full Text Request
Related items