| In the era of big data,massive data are generated every day.These data arrive in time series,accompanied by new features and missing features,resulting in dynamic changes in feature space,which is a typical incomplete data stream.At present,it is a difficult and hot issue in the field of data mining to obtain useful information from dynamically generated and real-time arrival data streams and construct prediction models.As an effective way to process streaming data,online learning can process data instances in real time,dynamically optimize the objective function,update the prediction model,and make the model adapt to data changes.However,the existing methods can only deal with data streams that change in a fixed feature space or in a specific mode,or neglect the dynamic changes of data distribution,so that they cannot effectively solve the incomplete feature space and class distribution imbalance.Moreover,it may lead to huge computational overhead by completing missing features or training multiple classifiers.Based on the above problems,the main content of this study is as follows.Firstly,to overcome the limit of the existing approaches,such as they cannot effectively dealing with incomplete data streams and the high training cost,this paper proposed an algorithm,Online Learning for Incomplete and Imbalanced Data Streams,namely,OLIDS.The OLIDS algorithm identifies different features through feature space projection and extracts information in features.In order to adapt to the dynamic of feature space and avoid feature reconstruction,the classifier is re-weighted with the help of the confidence of the feature space.Then,the classifier is updated in real-time by following passive-aggressive update criteria.Finally,the classifier is truncated with the help of the relative uncertainty vector of the universal feature space to further improve the generalization performance of the model.Secondly,existing incomplete data algorithm neglects class distribution imbalance,which leads to poor generalization performance of the model.We achieve F-measure online optimization by minimizing the weighted surrogate loss and the dynamic cost mechanism is established to improve the performance of the model on imbalanced data.Moreover,we analyze the upper bound of the cumulative loss of the algorithm when data are linearly separable and linearly inseparable,respectively.Then,we derived the boundary of the number of misclassifications of OLIDS for any class.Thirdly,to evaluate the performance of the proposed algorithm,we simulate the three different scenarios: trapezoidal data stream,feature evolvable stream,incomplete and imbalanced data stream,on the 14 representative data sets from different fields to compare the performance of the OLIDS with state-of-the-art algorithms.We use F-measure,G-mean and runtime to analyze the performance of the proposed algorithm and state-of-the-art algorithms in three different scenarios.Finally,the OLIDS and state-of-the-art algorithms are applied to the real movie review emotional text classification scene,and the experimental results are analyzed to verify the practicability and effectiveness of the algorithm in practical applications. |