Clustering analysis is a common technique in data mining.With the rapid development of the information age,the scale of data is expanding rapidly,and the proportion of low-quality data is increasing.This kind of data presents the problems of large redundancy,weak features,strong noise,and high-dimensional features.These factors seriously affect the performance of clustering analysis.It is of great significance to propose effective low-quality data clustering methods for low-quality data characteristics,so as to obtain valuable knowledge or patterns.Therefore,this paper carries out in-depth research on the feature selection of low-quality data,clustering methods,and the application in the measured spectral data of Guo Shoujing telescope.The main contents include:(1)A clustering analysis method based on feature selection is given.Firstly,the initial feature subset screening.The data is segmented and each segment is regarded as a candidate feature,and the candidate features are filtered by the ranking algorithm based on variance analysis to generate an initial feature subset to remove some irrelevant features.Second,data discretization and optimal feature subset generation.K-Means clustering was performed on the feature subsets,and the corresponding cluster number of each feature subset was used as the reconstructed feature to realize data discretization.The optimal feature subset was obtained by using exhaustive search to optimize the reconstructed features.Finally,cluster partition under set theory.Taking the optimized feature subset as the input,combined with the theory of set in statistics,the intersection and union of each feature are calculated,and the equality is divided into the same cluster.Finally,the cluster partition under the optimal feature subset is obtained.Clustering analysis experiments on UCI and real spectral data sets show that compared with four traditional clustering methods,the proposed method can effectively improve the accuracy and time efficiency of low-quality data clustering.(2)Clustering Analysis for LAMOST Unknown Spectral data.Aiming at the characteristics of LAMOST Unknown spectral data,such as weak characteristics,low signal-to-noise ratio and high dimension,a clustering analysis framework for low-quality spectral data sets was proposed,and the factors leading to low-quality were explored.Firstly,13 clusters are obtained by NAPC-Spec algorithm.Secondly,according to the clustering results,the feature lines,continuous spectrum and other features of each class are preliminarily analyzed.Finally,the spectral changes are analyzed and their origins are explored by statistical analysis of observed target features,environmental features,and instrument state features.This study contributes to a better understanding of the causes of low quality spectral data,which is of great value for improving the overall data quality of large-scale spectral sky survey projects and promoting the research of spectral processing techniques.(3)A low-quality data cluster analysis prototype system is designed.According to the above research content and the characteristics of low-quality data,a low-quality data cluster analysis prototype system is designed and implemented.The functions of the system include data preprocessing,feature selection,data discretization,general data clustering,celestial spectral clustering and result visualization.The test of Unknown spectral data verifies the correctness and data analysis performance of the system,which can provide technical support for users to analyze and process spectral data. |