Font Size: a A A

Outlier Detection And Application Of Categorical Data In Spark Cluster

Posted on:2021-02-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:J L LiFull Text:PDF
GTID:1368330611457368Subject:Industrial Engineering
Abstract/Summary:PDF Full Text Request
The arrival of industrial big data has promoted the development of modern manufacturing industry.In the development process of manufacturing industry,a large amount of data has been accumulated.Data mining is an effective way of big data analysis.The results of data mining can be used in the production,management and operation of machinery manufacturing industry to promote the optimize production,production technology and diagnose equipment failure of manufacturing enterprises.Thus,the production costs can be reduced and the enterprise operation efficiency can be improved.In the current mechanical product processing,hidden problems caused by equipment performance decline,precision loss,wearing parts wear,human factors and other factors are generally not easy to find but will affect the quality of the product.Outlier detection,as a data mining method,can effectively find hidden problems from machining data.In this paper,under the Spark Cluster system environment based on memory computing,the author studies the categorical data outlier detection theory and method as well as the cold roll processing data outlier detection method,which not only provide an effective new method and realization way of parallel clustering outlier detection for big data analysis,but also provide an effective means to effectively find hidden problems with abnormal characteristics,such as equipment accuracy decline,tester qualification and processing environment that may exist in the processing of mechanical products.The main research results are as follows:(1)An outlier detection algorithm for categorical data based on feature grouping--WATCH is proposed.By measuring the correlation between data features,the algorithm divides data features into multiple feature groups,and can find outliers hidden in the feature subspace,effectively improving the outlier detection accuracy,and can find the differences of feature patterns from different aspects.Experiments verify the efficiency of WATCH algorithm in precision,efficiency and interpretability.(2)Aiming at the WATCH algorithm is insufficient to handle large-scale data,a parallel outlier mining method based on feature grouping--called POS--is proposed in the Spark Cluster environment.Through parallel feature grouping and parallel outlier detection,POS effectively distributes large-scale data sets on compute nodes of the Cluster.The parallel optimization strategy of RDD caching and parameter tuning improves the performance of POS algorithm.The experiments on Spark Cluster verify the scalability and extensibility of POS algorithm.(3)The mixed attribute outlier detection method based on mutual information is proposed.This method uses mutual information mechanism to give the weighted method of mixed attributes.And,the outlier scores of numerical data and categorical data are defined respectively,the normalized processing is carried out to measure the similarity between data objects more objectively and accurately.The outlier detection performance is improved effectively.Experiments verify the effectiveness and feasibility of the algorithm.(4)A parallel computing method of mutual information based on Spark--Mi CS is proposed.Firstly,the algorithm uses column transformation to transform the data set into multiple data subsets,and then two variable-length arrays are used to cache the intermediate results,which solve the problem of large amount of computation and strong repeatability of mutual information in categorical data.Secondly,regarding the problem of data skew in the parallel computing of mutual information based on Spark,the Mi CS algorithm redefines the data skew model to quantify the data skew between the partitions created by Spark,and alleviates the data skew in the shuffle process to optimize the network performance.(5)The actual production data of cold roll is taken as the application background,a prototype outlier detection system for cold roll manufacturing process in Spark Cluster environment is designed and implemented based on the detailed analysis of the complexity of cold roll manufacturing process,the typical failure mode of cold roll and the factors affecting the quality of cold roll production process.The data preprocessing,parameter setting,system architecture and system function modules are introduced in detail.Through outlier detection,the system can effectively dig out the hidden problems with abnormal characteristics in the process of cold roll product processing from the big data,so as to find out the possible quality defects of products.
Keywords/Search Tags:Intelligent Manufacturing, Big Data, Cluster System, Outlier Detection, Cold Roll
PDF Full Text Request
Related items