Font Size: a A A

Online Clustering Method For Streaming Data Based On Statistical Inference

Posted on:2024-02-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y L LiuFull Text:PDF
GTID:2568306923473414Subject:Statistics
Abstract/Summary:PDF Full Text Request
We are in an era of big data,where the rapid development of computer technology facilitates the collection of massive amounts of data,but poses a great challenge to the storage and computation of big data.When data arrives as a stream in a sequential,massive,fast and continuous manner,we call it streaming data.Compared with static Big Data,real-time processing of streaming data is more prevalent in real life,so effective data mining of streaming data will promote efficient and sustainable development of the country,enterprises,and society as a whole.Among them,the clustering analysis of streaming data is an important problem with great research and practical value.However,traditional clustering methods are no longer applicable to the processing of streaming data due to its fast update speed,large data volume and storage difficulties.Although some solutions have been proposed to extend the traditional clustering methods to streaming data clustering,most of them are from the computer domain rather than from the statistical perspective.Online learning,a popular approach in the field of statistics for handling streaming data,eliminates the need to store all the raw data,but only combinations or summaries of the raw data,which will greatly reduce the storage pressure on computers and allow real-time processing of streaming data.However,current online learning methods are mainly focused on the field of supervised learning and algorithm optimization,and have not been extended to unsupervised learning methods.Therefore,this thesis applies online learning ideas to the field of unsupervised learning from the perspective of statistical inference,and proposes a method for online clustering based on stream data.The method is inspired by the convex clustering method,and the concept of Individual Center(IC)is proposed by studying the cluster centers based on the individual model.And the estimated value of IC is solved by the quadratic loss function of ridge fusion,and then the online updated form of IC estimation is given by combining the idea of online learning.Finally,we use hypothesis testing to judge whether IC is equal to the given clustering center,and then we judge the category to which each individual belongs by the magnitude of P-value,so as to realize online clustering of stream data.In addition,we give the online update algorithm for IC estimation and introduce the algorithmic process for online clustering of stream data.In addition,we prove the asymptotic normality of IC estimation,and simulate the fitting effect of IC estimation with the asymptotic normality property,and the theoretical results and experimental analysis verify the feasibility and effectiveness of the proposed method.
Keywords/Search Tags:Streaming data, Online learning, Convex clustering, Hypothesis testing
PDF Full Text Request
Related items