Open source attracts developers from all over the world to participate in the construction of open source projects due to its characteristics of openness,collaboration and sharing,which greatly improves the development efficiency of software and provides assistance for the digital construction of enterprises and the development of data science.With the rapid development of open source software development and ecological construction and the maturity of code hosting platforms,a large number of developer behaviors during the development process,such as Issues,Pr(Pull Requests),Forks,and project comments,will be recorded in real time by platforms such as GitHub.Open-source researchers can collect massive streams of behavioral data with the help of GitHub’s API.If we can efficiently perform information mining and data analysis on these behavioral data streams,we can reveal the actual factors that affect multimodal behavior fluctuations and effectively eliminate the interference from invalid factors such as malicious robot commits.Then,build an open source measurement and ecological assessment system to effectively maintain the efficiency of open source software research and development and the healthy operation of the open source community.Outlier detection and real-time prediction are two typical research topics in time series behavior.The research on outlier detection in this paper aims to filter out invalid data,identify ”minority” abnormal behaviors and abnormal patterns with obvious differences,and timely feedback the abnormalities to project managers to realize intelligent early warning.The real-time prediction research for behavioral data aims to fill in the missing behavioral data by mining the inherent information such as periodicity,probability distribution,and trend fluctuations of massive behavioral data,predict the trend of critical behavioral data,and measure the future development trend of open source projects and developers.Outlier detection and real-time prediction of developer behavioral data can not only effectively filter out outliers in behavioral data,but also quantitatively analyze the development prospects of open source projects.There are few studies on outlier detection and real-time prediction of dynamic behavioral data in the academic.The fundamental reason is that to carry out information mining on this dynamic massive time-series data stream,it is necessary to fully consider issues such as storage resource consumption,computing scale,time-space complexity,”concept drift” and ”curse of dimensionality”.At the same time,the inherent non-stationarity,skewed distribution,and time-series nature of behavioral data streams in real-world opensource projects will further increase the difficulty of real-time prediction and anomaly detection.Therefore,in order to ensure that the research results of this paper have practical value,on the one hand,it is required that the scheme proposed in this paper can run efficiently under the premise of limited infrastructure,and on the other hand,it must realize low-latency real-time analysis with minimum computing overhead.Ensuring that research outcomes are more competitive.To this end,in this paper,we design an efficient and low-time-consuming trend prediction and anomaly detection scheme based on the real-world needs of open-source research,using developer behavior data from GitHub as a research subject.The main findings and contributions of this paper are as follows:1.Real-time outlier detection for multi-behavior data streams:The conventional outlier detection algorithm in multi-dimensional data streams has the following difficulties:(1)Difficult to store massive data.(2)Poor detection effect in highdimensional data.In this paper,we propose the CELOF algorithm to overcome these two limitations and achieve accurate outlier detection.CELOF first uses information entropy to construct a new index weight computation method to distinguish the influence weights of different indicators,and then uses a sliding window mechanism and clustering method to cluster the data.Finally,it devised a novel proximal distance factor discriminant method to extract and compress data from different clusters,thereby reducing data storage and enabling real-time outlier detection.The final experimental results show that the CELOF algorithm not only improves the detection accuracy,but also significantly reduces the runtime consumption compared to the commonly used anomaly detection models.2.Trend prediction for single-behavior data stream: To construct a predictive model with high robustness without any prior information and assumed distribution,this paper proposes a hybrid model based on Recursive Empirical Mode Decomposition(REMD)and Memory Wavelet Neural Network(MWNN)to realize trend prediction of behavioral data.In REMD-MWNN,this paper first uses REMD to decompose the behavioral data into multiple intrinsic mode functions(IMFs)of equal length,and mines the hidden factors that affect the behavioral data.We then design a novel memristive recurrent neural network,MWNN,to predict IMFs individually.Finally,we integrate the predicted values of all sub-sequences to obtain a prediction result for the input data.Experimental results show that the proposed model is highly competitive with other algorithms in predicting future changes and capturing evolutionary patterns of hidden factors.3.Prediction for multi-row data with missing values: The multi-behavior data in open source is a typical high-dimensional time series data stream,but it contains a lot of noise and some missing data.To achieve accurate prediction in such scenarios,this paper proposes a Temporal Autoregressive Matrix Factorization(TAMF)framework that supports data-driven temporal learning and prediction.This model redesigns a novel autoregressive model based on a scalable factorization model that can make accurate predictions in the presence of normal multidimensional data or data with missing values.In addition,we design a trend autoregressive sub-model and a period autoregressive sub-model in TAMF to extract trend and period features of OSS behavioral data,which deeply mines the intrinsic information of the data and improves the predictive accuracy and generality of the model.Finally,this paper selects 10 daily monitoring data of OSS from GitHub for case analysis based on behavioral indicator sets.Experimental results show that TAMF outperforms existing methods in terms of scalability and prediction quality.In summary,this paper focuses on trend prediction and outlier detection for behavioral data streams in open-source projects and poses three fundamental questions.When facing massive multi-behavior data streams,how to achieve high-precision realtime anomaly detection with limited computing resources and storage facilities is a fundamental problem; Behavioral data streams in real-world open source projects have obvious features such as skewed distributions and concept drift.How to realize the trend prediction of single behavioral data without any prior conditions is a basic problem; Data collected by open source projects will have missing values.How to fully exploit the correlations between behaviors to complete the data and design more accurate multi-behavior trend prediction is a fundamental question.To address the above three issues,we propose the corresponding solutions in this paper based on the existing research results.Experiments on standard datasets and real-world open-source behavioral data demonstrate that all proposed research solutions are efficient and accurate. |