| Nowadays society has been and will be in the era of large data,massive data has four-V characteristics,namely,Volume,Variety,Velocity and Veracity.Although the current amount of data is large,but it often carries some redundant information,what people concerned is the effective data characteristics it carries.If the data is treated as a large matrix,the matrix is sparse in most cases and can be mapped to a lower dimension of space,which we call ed the data feature space,projecting the original data to that space can be obtained after the characteristics of nuclear data,and the characteristics of nuclear data often carry the original data of the main information.After the definition of -characteristic kernel data and -feature space with information loss rate less than .Our aim is to find the optimal feature data and the optimal feature space.Based on the above reasons,our paper proposed some methods of mining the main components of data and reduction of information between variables by using Hadoop distributed computing framework based on the features above.At the same time,some techniques and objectives are proposed for the weak points of Hadoop.The main contents of this paper include the following aspects:First of all,we explain the preparatory knowledge,which provides theoretical support and measurement for the implementation of specific algorithms.Then,a new vector data structure for Hadoop is provided for the distributed application environment.The workflow and data format of the data sender and receiver between different nodes are defined.Second,the data preprocessing module treats the input information into the form that the system can recognize,then we obtain the tridiagonal matrix and make the QR algorithm on it to obtain the characteristic information.Finally,the feature vector is transformed to obtain a new projection space,and the original data is projected into the new space for the kernel data set.In this paper,the vector is ofter processed in the process of implementation.Although the dimension of the vector is large,the vector is only occupied by KB,and each part of the matrix stored in the Hadoop distributed file system takes up the size of a block,which is the reason why the memory of Name Node is high and file access efficiency is low.So in order to deal with the weak points that Hadoop is not good at dealing with massive small files,we propose a new techology to optimize HDFS,the basic idea of whici is to merge small files into a large file and then build an index.Further,name-based indexes can effectively improve the file access efficiency.The experimental results show that the proposed strategy can effectively e xplore kernel data set from the original data set. |