Font Size: a A A

The Research Of High Efficient Data Mining Algorithms For Massive Data Sets

Posted on:2014-01-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y ShenFull Text:PDF
GTID:1228330395992317Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the fast development and widely usage of information technologies, the enterprises and government departments and all kinds of other organizations have accumulated lots of data. Many traditional analysis technologies such as query and statistics can only do basic processing but not high level analysis, which can transform data into knowledge automatically and intelligently. It was because of such background that the data mining is been paid widely attentions to and been researched intensively. So far, data mining is an important research filed and has made significant progress. Data mining is regarded as a process which can derive unseen and unknown knowledge with potential value from many raw data. Data mining is considered as one of key technologies of future information processing. At present, data mining is not only regarded as an important research topic of pattern recognition and machine learning and so on, but also is an important research field which can bring in big return by people of the industrial world. The amount of data is very considerable. Nevertheless, the patterns and knowledge from it are extremely meaningful and can bring in tremendous economic profits.With the further development of information technologies and the continuously expanded scale, scope of database application usage, there are lots of massive data sets. In addition, with the improved data sets acquisition technologies and the improved management abilities of enterprises and the government by computers, more and more massive data sets have emerged. For these massive data sets, some originally effective data mining algorithms encounter some new issues which should be researched and solved further. For examples, many traditional data mining algorithms can get good results for relatively small data sets but can’t get the final results within an acceptable time for massive data sets because of big computational complexity. Specially, some original effective algorithms can’t execute completely because the massive data sets couldn’t be loaded into memory entirely or the memory occupied in execution process goes beyond the system available memory. In order to improve execution efficiency, some novel technologies have been used such as sampling and characteristic summarization and so on. But these technologies make the quality of results go down to some extent. Based on researches and summary of relative data mining algorithms, this thesis conducted in-depth researches and analysis focusing on the memory bottleneck problem of association rules mining and issues of low efficiency and quality of clustering. The main work of the thesis is as follows: (1) The important research achievements of clustering and association rules mining in data mining field are introduced. The latest progress, current key issues and development direction of clustering and association rules mining for massive data sets are tracked. Based on these in-depth researches and summary, the features, advantages and disadvantages of current algorithms are compared and the new challenges of data mining are summed.(2) Focus on the memory bottleneck problem of association rules mining for massive data sets, a novel algorithm called disk table resident FPTREE growth is presented and could be called DTRFP_GROWTH for short. This algorithm is an improved version of FP_GROWTH, which reduces memory usage by storing FPTREE in mining process to disk based on lightweight DBMS. It can succeed in dealing with association rules mining for massive data sets with low user support.(3) Focus on execution efficiency further, a new algorithm called disk resident B+tree fptree mining (DRBFP_MINE) is presented, which stores FPTREE on disk using B+tree index directly. This algorithm stores FPTREE partly to use less memory when memory is not enough, which has higher access efficiency to FPTREE nodes by direct implement of B+tree index. Apart from this, this algorithm further improves the storing mechanism and storing strategy of FPTREE. It stores FPTREE partially not entirely using a LIFO and bottom-up method. This new method could improve execution efficiency further.(4) Focus on the issues of clustering for massive data sets such as low quality results and unstable results and slow convergence in addition, a new algorithm called semi-supervised labels onescan kmeans (SSLOKmeans) is presented. In order to analyze massive data sets, the usual clustering algorithms always use some special technologies such as sampling and characteristic summarization and so on. Because of that and the limitation of core algorithm, the usual clustering algorithms for massive data sets have some disadvantages such as low quality results and unstable results and slow convergence in addition. Absorbing the main thoughts of semi-supervised learning, this work presents a novel algorithm called SSLOKmeans for short which integrates labels sets with massive data sets clustering frame together. This algorithm makes use of labels which is resident in memory to guide the whole process of clustering. It can improve the efficiency and results quality further.(5) The probability clustering for massive data sets is researched based on previous work. A novel algorithm called scalable EM probability clustering algorithm for massive data sets based on partial constraints information is presented, which can be called PC_SEM for short. The previous work focused on certainty clustering. That means one data can only belong to single category. But in the real world clustering process, one object could belong to some different categories with some different probability. The data sets on behalf of relative objects have some overlap and are not separated well generally. The usual probability clustering algorithms are always for small data sets and there are some issues when mining massive data sets such as unstable results and low quality results and slow convergence in addition. The execution performance should be improved further. Absorbing the main thoughts of semi-supervised learning, PC_SEM is presented. This algorithm makes use of partial constraints information which could be collected from data sets automatically to guide clustering process. The efficiency and quality of probability clustering for massive data sets get improved’further by this algorithm through using partial constraints information.Some research work about data mining for massive data sets in this thesis contribute to get over the memory bottleneck issue of association rules mining and to improve efficiency and quality of clustering. The work of this thesis can also be used for reference in later relative research.
Keywords/Search Tags:massive data sets, association rules, disk resident FPTREE, massive datasets mining, semi-supervised learning, scalable EM, probability clustering
PDF Full Text Request
Related items