Nowadays,we have entered the era of big data of the Internet of everything.In the process of data collection and transmission,there will be some data missing and data errors and omissions,and we need to rely on partial data to recover all data.The data set can usually be expressed in the form of matrix,and matrix filling is to study how to fill the missing data quickly,reasonably and with the smallest error through the algorithm when there are missing items in the matrix.The closer the value is to the actual lost value,the better the filling effect is.In order to overcome the disadvantages of low recovery accuracy and long recovery time of traditional matrix filling algorithms,this paper proposes two parallel matrix filling algorithms to improve the recovery accuracy and reduce the recovery time.The specific research work is as follows:A matrix partitioning method based on optimized K-means clustering is proposed,which can block high-dimensional matrices before matrix filling,so as to carry out more accurate matrix recovery and parallel calculation.First,the matrix is divided into rows or columns.According to the three evaluation parameters designed in this paper,the cluster center dominance method is used to determine the initial center vectors successively.Then,the initial center vector set was taken as the input parameter of K-means,and the large matrix was divided into K submatrix blocks according to the similarity degree by k-means clustering.Finally,iteration is stopped when the sum of squares of error is less than the set value.At this point,the pre-processing of the large matrix is over,and the large matrix is divided into several sub-matrix blocks with better correlation than the original matrix.External parallel padding can be performed between each submatrix block.An internal parallel strategy for stochastic gradient descent filling algorithm is proposed.In the process of the algorithm,it is found that the sample elements in the diagonal position of the original matrix are not dependent on each other and have no influence on each other when the feature matrix is iteratively updated.Based on this characteristic,this paper takes the sample elements of diagonal position of matrix for synchronous update,thus reducing the time cost of matrix filling.Finally,two open source network datasets,Harvard226 and GEANT,were used to verify the above two methods.The experimental results show that the partitioning method of clustering submatrix is independent of the underlying algorithm,which can well divide the similar submatrix and further improve the precision of matrix filling.The parallel algorithm can also improve the calculation speed,so that the overall algorithm can better calculate in large scale data. |