The Design And Implementation Of Distributed Feature Processing Tools Library Based On Spark

Posted on:2020-03-12

Degree:Master

Type:Thesis

Country:China

Candidate:Y Q Pu

Full Text:PDF

GTID:2518305735985479

Subject:Master of Engineering

Abstract/Summary:

PDF Full Text Request

With the development of the internet economy,online services generate huge amounts of business data.The storage,transmission and usage of these data are very difficult in the traditional stand-alone mode.The main difficulty is that the processing speed of the single machine is difficult to further expand and improve.For data mining tasks,processing features and training models are also constrained by hardware.To cope with these problems,distributed storage and distributed computation frameworks are widely used.Distributed computation solves the problem of computing power expansion.However,the original machine learning algorithm and feature processing algorithm are basically designed for single-machine environment.The usage scenario of distributed resources has not been considered,so the original algorithm cannot be applied directly to take advantage of distributed resources.Therefore,this thesis will expound the redesign,implementation details and applications of the four types of feature processing algorithms in the distributed environment.This thesis chooses the Spark framework as the basis for implementing parallel algorithms.Spark is currently the most popular distributed computation framework and is widely used in enterprises.We integrated the distributed implementations of the algorithms into a distributed feature processing tool library to help users process data on distributed environment.In this thesis,for difficulties encountered in data mining tasks,some classic processing algorithms are redesigned and implemented for distributed environment.The problems that need to be solved can be divided into four categories:imbalanced data distribution,discretization of continuous features,correlation extraction of discrete features,and encoding for high cardinality categorical feature.The project consists of four parts:(1)Oversampling module:This module designs and completes the distributed implementation of SMOTE and its improved version(Borderline)algorithm to help users solve the problem of data imbalance.(2)Feature correlation extraction module:Three different distributed training algorithms of factorization machine are implemented to help users extract crosscombination feature information and ensure the effectiveness of sparse features learning.(3)Feature discretization module:The distributed implementation of ChiMerge algorithm is completed,and the MDLP(minimum description length principal)algorithm is improved to help users complete the optimal feature discretization.(4)High-cardinality feature coding module:It implements the mapping from highcardinality discrete features into continuous feature space,and solves the curse of dimensionality caused by one-hot encoding.

Keywords/Search Tags:

feature processing, oversampling, distributed computing, discretization, factorization machine, high-cardinality attributes

PDF Full Text Request

Related items

1	A Study For Discretization Of Real Value Attributes Base On Rough Se Theory
2	Study On Comparison Of Discretization Algorithms Of Continuous Attributes
3	Research On Discretization Methods For Quantitative Attributes
4	Research On The High-Order Factorization Machine Based On Combined Features
5	An Algorithm For Discretization Of Continuous Attributes Based On NBC Clustering In Rough Set Theory
6	An Algorithm For Discretization Of Continuous Attributes Based On Nbc Clustering In Rough Set Theory
7	Research And Application Of The Discretization Of Real Value Attributes
8	Research On Discretization Of Attributes Based On Granular Computing And Rough Set
9	A Study For Discretization Of Real Value Attributes And LMS Algorithm
10	Research On Discretization Of Continuous Attributes