Font Size: a A A

Research On Feature Selection Methods For Symbolic Interval Data And Their Application

Posted on:2015-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:Y C LiuFull Text:PDF
GTID:2298330467486155Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the development of data collection and storage technologies, more and more data is emerging in many fields.The expanding of data is increasing the demand for massive data analysis methods and techniques. Dealing with huge amounts of data by traditional data analysis methods is computationally expensive, and is also difficult to grasp the overall nature of the sample. Symbolic Data Analysis techniques overcome the disadvantages of the traditional methods of data analysis in some extent by compressing data. As the most common form of symbolic data interval data has important significance for research.Feature selection for symbolic interval data can reduce the dimension of the data, and extract the key features. To analyze the symbolic interval data, we firstly considered the similarity measure of interval data. Therefore, this paper arranged and compared several common interval similarity measure methods, and found that interval Hausdorff distance and Euclidean distance was more suitable for measuring interval data, and then used them as the basis similarity measure methods for interval data. In addition, this paper presented a new interval distance measure method, wnich can adjust theparameters according to the different distribution, to represent the meaning of interval number betterly. Then, since the original feature selection methods for symbolic data interval can not identify the features in which class centers are close to each other, for this shortcoming, this paper proposed a new feature selection method (FSMSID). In this method, an optimization model, which aims to maximize the similarity between each sample and its class center, was established to estimate the feature weights for symbolic interval data. Feature weights were obtained by Lagrange multiplier method. Then,we constructed the corresponding nearest neighbor classifer based on the feature weights, and classification accuracy was utilized to the evaluate the corresponding features. We used ten-fold cross-validation method to evaluate the accuracy of the classifier. Finally, in order to verify the effectiveness of the method, we did numerical experiments in artificially generated data sets and real data sets, respectively, numerical results showed that this method can effectively remove the irrelevant features, recognize the features associated with the class label.After that, in order to verify the superiority of analysis methods for symbolic interval data, FSMSID method was applied to fetal heart monitoring data sets(Cardiotocography).First-ly, some preprocessing of Cardiotocography was done, and then the data was converted to symbolic interval data. Finally, FSMSID algorithm was applied to the data.Then in order to verify the advantages of interval symbolic data analysis on large-scale data processing,we compared FSMSID with nearest neighbor classifier in accuracy and time complexity.In addition, in order to verify that the interval symbolic data has more advantages than the sample mean,this paper compared, the corresponding classifiers produced by the interval symbolic data and the sample mean,respectively, the classifier precision was utilized to judge their quality.
Keywords/Search Tags:symbolic data analysis, similarity measure, feature selection, nearestneighbor classifier, interval data
PDF Full Text Request
Related items