Font Size: a A A

The Design And Implementation Of Distributed Feature Processing Tools Library Based On Spark

Posted on:2020-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q PuFull Text:PDF
GTID:2518305735985479Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
With the development of the internet economy,online services generate huge amounts of business data.The storage,transmission and usage of these data are very difficult in the traditional stand-alone mode.The main difficulty is that the processing speed of the single machine is difficult to further expand and improve.For data mining tasks,processing features and training models are also constrained by hardware.To cope with these problems,distributed storage and distributed computation frameworks are widely used.Distributed computation solves the problem of computing power expansion.However,the original machine learning algorithm and feature processing algorithm are basically designed for single-machine environment.The usage scenario of distributed resources has not been considered,so the original algorithm cannot be applied directly to take advantage of distributed resources.Therefore,this thesis will expound the redesign,implementation details and applications of the four types of feature processing algorithms in the distributed environment.This thesis chooses the Spark framework as the basis for implementing parallel algorithms.Spark is currently the most popular distributed computation framework and is widely used in enterprises.We integrated the distributed implementations of the algorithms into a distributed feature processing tool library to help users process data on distributed environment.In this thesis,for difficulties encountered in data mining tasks,some classic processing algorithms are redesigned and implemented for distributed environment.The problems that need to be solved can be divided into four categories:imbalanced data distribution,discretization of continuous features,correlation extraction of discrete features,and encoding for high cardinality categorical feature.The project consists of four parts:(1)Oversampling module:This module designs and completes the distributed implementation of SMOTE and its improved version(Borderline)algorithm to help users solve the problem of data imbalance.(2)Feature correlation extraction module:Three different distributed training algorithms of factorization machine are implemented to help users extract crosscombination feature information and ensure the effectiveness of sparse features learning.(3)Feature discretization module:The distributed implementation of ChiMerge algorithm is completed,and the MDLP(minimum description length principal)algorithm is improved to help users complete the optimal feature discretization.(4)High-cardinality feature coding module:It implements the mapping from highcardinality discrete features into continuous feature space,and solves the curse of dimensionality caused by one-hot encoding.
Keywords/Search Tags:feature processing, oversampling, distributed computing, discretization, factorization machine, high-cardinality attributes
PDF Full Text Request
Related items