Font Size: a A A

A High-Performance Chinese Distributed Computing System (CH-Spark)

Posted on:2018-06-08Degree:MasterType:Thesis
Country:ChinaCandidate:J J XuFull Text:PDF
GTID:2428330590477755Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the accumulation of data,the processing becomes more and more complex.Thus,the distributed computing system should be used to solve the problem.The traditional system provides a solution,but the performances are not satisfying enough and the applications are limited.The paper proposed an optimized system based on Spark.It optimizes the performance under the homogenized environment and extends the distributed Chinese NLP ability.From the performance angle,the scheduling strategy of Spark assumes that cluster is homogenized,but as the update of hardware in the cluster,it becomes more and more heterogeneous.The system cannot ignore this existing issue anymore.The experiments show that the original scheduling strategy cannot meet the performance requirement.The paper proposes a new creative strategy,which refers to the idea of hierarchical scheduling.It combines the task complexity,worker performance and CPU usage as its scheduling factors to improve the performance.From the Chinese NLP extension angle,Spark is not designed to handle these tasks.Thus,it will cost lots of repetitive coding work with low efficiency.What's more,the code may be incompatible with the RDD structure.The new system proposes an architecture with high flexibility and user-friendly interfaces to handle distributed Chinese NLP tasks.It supports the major machine learning models and algorithms which is absolutely optimized from the bottom layer.The new system,named CH-Spark,proposes hierarchical cluster idea to solve the homogenized cluster scheduling problem and proposes an effective and flexible architecture based on RDD to handle Chinese NLP tasks.The experiments show that the performance of new system is absolutely better than Spark.
Keywords/Search Tags:Spark, Homogenized cluster, Scheduling strategy, NLP, Machine Learning
PDF Full Text Request
Related items