Font Size: a A A

Automatic Feature Engineering System For Tabular Data

Posted on:2022-12-05Degree:MasterType:Thesis
Country:ChinaCandidate:F Y ZhangFull Text:PDF
GTID:2518306752954379Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
In most scenarios,the data required for machine learning is often stored in a database or other storage system in the form of tables.In the machine learning pipeline,feature engineering is often one of the key factors that determine the performance of the model,and it is also one of the most time-consuming steps in machine learning development.Even senior experts need constant iteration and trial and error to find a feature engineering solution with better performance.Due to domain knowledge and other reasons,it is easy to ignore some meaningful features in manual feature engineering.Therefore,this thesis designs and implements an automatic feature engineering system for tabular data to help data scientists improve the performance of models and accelerate the development of machine learning applications.In this thesis,the problem of automatic feature engineering is defined as searching for the best set of feature transformation operations to maximize the performance of the model.Aiming at the deficiencies of existing research work in automatic feature engineering,the goal of this system is to meet the following four requirements: 1)The search method supports both regression and classification datasets for automatic feature engineering; 2)Support automatic feature engineering of high-dimensional data; 3)Support performance-oriented search,and the search results are as close to the global optimum as possible; 4)Search can be completed quickly in a huge search space.This thesis abstracts the process of automatic feature engineering into a Markov decision process,and introduces reinforcement learning to perform automatic feature engineering.Each part of the automatic feature engineering system proposed in this thesis is decoupled.Developers can easily incorporate prior knowledge into the components of this system to perform automatic feature engineering more efficiently.The main contributions are as follows:(1)This thesis introduces reinforcement learning to realize automatic feature engineering,and the design of action space,state space,and reward function allows the agent to perceive environmental information more accurately,and the design of the reward function enables the agent to search performance-oriented and get better search results.The introduction of reinforcement learning and the design of the reward function make the search method suitable for classification problems and regression problems at the same time.(2)This thesis proposes a delayed reward mechanism and cache pruning strategy to reduce the number of calculations of the reward function,and proposes a lazy loading mechanism to reduce the number of feature transformations.The optimization strategy proposed in this thesis increases the search speed by more than 10 times on average compared to before optimization,and increases by more than 30 times compared with other methods on high-dimensional data.(3)This thesis proposes a fine-grained action representation method,which not only avoids feature selection operations when the agent performs each action to reduce the time for each action execution,but also achieves a more accurate representation of the action to help the agent iterate faster and learn better strategies.(4)The feature selection and feature transformation function selection reduce the size of action space and state space,to support automatic feature engineering on highdimensional data and accelerate the search process.This thesis implements other related research work and compares them with the method proposed in this thesis.The experimental results prove the effectiveness and efficiency of the method proposed in this thesis.
Keywords/Search Tags:Automatic Feature Engineering, Reinforcement Learning, Delayed Reward, Cache Pruning, Lazy Loading, Fine-Grained
PDF Full Text Request
Related items