Font Size: a A A

Outlier Mining Research, Based On Cloud Theory And Data Space

Posted on:2006-11-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:J Q YuFull Text:PDF
GTID:1118360152495002Subject:Use of agricultural resources
Abstract/Summary:PDF Full Text Request
Nowadays the world is in a changing era that information is highly used. And the impetus for the dramatic changing is date information. With the high-speed development of data acquisition method, data information is constantly collected by various communication and data acquisition equipment .To make so much data information to be genuine resources but not data redundancy or rubbish, data mining technology that can discover knowledge from large databases automatically, rapidly, and effectively and extract mode that hide in them comes into being, and has being developed at amazing speed.Geographic Information System (GIS) has been widely regarded as an important means for global spatial data management. The advent of GIS arouses people's interest in developing spatial DBMS. Based on the feature of spatial data, spatial data mining technology applies the conventional data mining technology to GIS on the platform of spatial database and its basic spatial data analysis function. The effective process of spatial data mining includes data preprocessing, data mining, mode evaluation and knowledge presentation.Recently the spatial data mining technology is on the development stage. Some technology and method about spatial data mining have been proposed both here and abroad, however, those mining technologies are not valid for all the spatial data mining, for example, the current spatial data mining technologies can not be applied directly to the mining of distributing rules of the soil attribute. As a result, it is proper that mining technologies should vary according to various object data. There are still many problems to be solved in spatial data mining technologies, for example, the singleness of the preprocessing technology before data mining, the deficiency of clustering algorithm during processing the nondetcrministic and random data, the insufficiency both in the cognition of outlier and the technology of mining for outlier. The purpose of the research is to propose a data mining algorithm suitable for the feature of data mining, which has been on the base of analysis of the spatial data' indeterminacy and used the conventional data mining method for reference. The course of the research is carried through mainly according to the process of data mining: (1) eliminate the noise or disagreement data to improve the data quality, study efficient method of data sampling in order to obtain the data which can better represent the feature of source data sets for the further study. (2) Analyze the relationship between outlier and clustering, study the conventional algorithm of data mining, especially the clustering and outlier detection algorithm. (3) Analyze a good many factors that effect the indeterminacy of spatial data and the deficiency of conventional analyze method of indeterminacy data and present the idea that clouds model theory and data field theory can be applied to the spatial data analysis and mining. (4) go into clouds model and data field theory, and integrate with spatial data .finally propose the spatial data clustering and outlier detection algorithm that are based on clouds model and data field theory.(5)bring in the clouds model and datafield theory, change the current evaluation method of data mining mode to improve evaluation precision .(6) introduce the clouds model and data field theory ,change the current visualization method to achieve visualization of data mining outcomes.The result of the study as follows: 1 .spatial data preprocessingData preprocessing mainly includes data scrubbing, data integration, data conversion, and data reduction. Data scrubbing processes the drain message and scrubby the dirty data, finds outlier and corrects the data disagreement. Data integration merges the data from various data sets, solves the problem of semantic ambiguity and stores them as conformity data. Data conversion converts data to the form suitable for the mining. Data reduction selects the data sets that need mining, which can reduce the extension of data processing.During the data selecting, to fleetly and efficiently achieve the data sets that can better reflect the feature of data sources, the study proposes a new data selecting method that base on density biased sampling. It scans data sets with the effective density estimation function, evaluates the improved density estimation function with formulate, educes the sample probability of every point in data sets with the improved density estimation function, and make a pass over the data sets with the method above to finally achieve data sets sample of expectation. Compared with the conventional data selecting methods based on simple random sampling, the method has higher flexibility and accuracy and has the advantages that having smaller quantity of data but can better reflect the feature of data sources.On the problem of data integration, the research proposes the information pattern integration technology based on XML. XML document is semi-structured data, data for information describing is either the structured data, for example, data of RDB, message of EDI or the nonstructural data, for example, TEXT documents, FAX documents. To study the data transform between XML document and data of other format is essentially tostudy how to transform the data from one structuring hierarchy to another .So the key to the transform between XML document and data of other format is to set up a well mapping structure. In XML document, DTD is used to describe the relations among the elements. These relations indicate the structure information of XML documents .A tree structure is used to restore the structure information of document hidden in DTD, which can well depict the structure of XML documents. The tree is called element tree. After achieving the DTD element tree and setting up the mapping between XML document and data of the other format, we can then realize the transform between XML documents and data of the other format. This research applies XML to the information integration technology by the data transform method based on the element tree, solves the problem how to integrate isomeric data sources, accomplishes the task of pattern integration during data integration.2. Study on the uncertainty of spatial outlierThe uncertain factors of the spatial data including randomization, ambiguity,default, indeterminacy etc, and the basic contents of spatial data uncertainty, for example, the position uncertainty, the attribute uncertainty etc, are studied in this paper. The conventional research methods for the uncertainty such as probability statistics, fuzzy sets etc, are analyzed here. The deficiency of the conventional methods in spatial data processing is pointed out and other problems, for example, the uncertainty does not decrease as the amount of data increase and etc, are also presented.So the clouds model is introduced to analyze the spatial data .The method presented in the paper breaks through the limitation of conventional methods, and can integrate ambiguity and randomization of natural language organically so that the qualitative data and quantitative data can map to each other. In this method the clouds generator algorithm is used mainly aiming a,t the uncertainty of spatial data. This method founds the basis of constructing uncertainty ratiocination., and sets up the relation of mutual mapping between the quantitative and qualitative data .It mainly includes positive clouds generator, inverse clouds generator, X conditional clouds generator, Y conditional clouds generator ,the error of clouds generator and etc.Positive clouds generator is a conversion model for the uncertainty that can convert a basic concept described in linguistic value to a numerical value, which is the mapping from the qualitative data to the quantitative data. Cloud droplets are generated according to the feature of data in the clouds and can be converted to the cloud when they have reached a certain amount. Inverse clouds generator is a conversion model for the uncertainty that can convert a numerical value to its linguistic value, which is the mapping from the quantitative data to the qualitative data. Inverse clouds generator can effectively convert a certain amount of precise data to the concept denoted by proper qualitative linguistic value (Ex, En, He) that can stand for the whole clouds reflected by the precise data. The larger amount of precise data corresponding to Cloud droplets, the more precise is the concept. In this way, through positive and inverse clouds generator in clouds model the mutual mapping relation between the quantitative and qualitative data is set up.X conditional clouds generator and Y conditional clouds generator are the basis of uncertainty reference with clouds model. The output of X conditional clouds generator and Y conditional clouds are cloud bands, X conditional clouds generator has one, and Y conditional clouds has symmetrical two with mathematical expectation of clouds as their center of symmetry .The higher of the certainty of the cloud center and the denser of the droplet. The lower of the certainty of the cloud center and the fewer of the droplet, the more apart is the droplet from the center. So the density of cloud droplets decreases as they leave away from the center of the cloud band.Furthermore, in this research clouds model is also extended to two dimensions and three dimensions, so that it can describe the characteristic concepts that expressed in multidimensional linguistic values. Through the concept of clouds model the analysis processing of uncertain spatial data is truly realized.3. the mining algorithm of spatial outlier analysis based on clouds model theory .The concept of data field is introduced in the research, the relationship among data energy, radiation of data energy and field intensity is also studied, the impact of data energy, radiation of data energy and field intensity on the data field is analyzed, the equal-potential line and surface, the potential centre and natural topological clustering and clustering diagram that given by different potential centers are emphatically studied. The paper integrates clouds model with data field theory, a sole data point is taken as a cloud droplet and every cloud droplet can produce energy, so millions of data point will form the potential energy, which can be superposed to be a data field. An outlier detection algorithm is presented based on the theory that integrates clouds model with data field theory to improve the conventional PAM algorithm and CLARA clustering mining algorithm in this research.The main line of the outlier detection based on cloud theory and data field as follow: take the whole feature space as a potential energy field affected by the data points, and the droplet stands for the certainty of the data point .If a data point has many adjacent points, then the potential energy of the data point is great, and if a data point has little potential energy then it show that the point has few adjacent point, this data point is detected as an outlier. The natural topological structure shows the respective aggregation of each data point in the space, the data point that locates furthest from the average potential energy or that has the lest potential energy is considered to be a isolation point in the space, which composes the outlier.The core of improved CLARA algorithm is based on the point that the center of the sample which presents the object can well present the K center of the entire data sets if the optimal center is included in the selected sample. The research apply the data field theory to the sampling in CLARA, finds k potential energy centers according to the distribution function of the potential energy, makes sure that K optimal centers of the object are included in the sample and been processed, obtains simple S from data field to carry out the CLARA clustering algorithm. On comparing the conventional CLARA algorithm with improved CLARA algorithm using the same data sets, it can be found that the improved CLARA algorithm can not only well include the optimal centre point of the data sets in the sample while sampling but also reduce the time cost of mining of the high-quality clustering.The core of improved PAM algorithm is selecting K initial centers to approach the optimal centers with less receiving time to quickly find K centers of clustering. In the algorithm, according to the data field space a data field that maps to it is set up, the potential energy center of K natural clustering is found, which essentially presents the center of gravity of congeneric clusters, and the point in and around the K centers is selected as initial centers of PAM algorithm so that the K centers selected can present or approach the true clustering centers of data sets as much as possible. The improved PAM algorithm can avoid or reduce the number of repetition for looking for K clustering centers, reduce the complexity of the algorithm and enhance validity of the algorithm.The improved algorithms above can well exemplify that clouds theory and datafields have the well processing ability on uncertain spatial data. From this section of study it is clear that: data in the data field limited in the domain deliver its energy by radiation, every data have its influence on the data fields and affect each other. So after data field theory has been implied to the data mining and integrated with clouds theory, the analysis of clustering and outlier with uncertain data can be better carried out.4.the study on the pattern evaluation of spatial outlier miningThis study is based on two rules for the method of pattern evaluation of data mining, first make sure whether one method of data mining is better than the others, second make sure whether the way that patterns are generated can meet for the actual needs and is convenient for comprehension and decision-making. According to the feature of cloud theory and data field theory, AG-REX for evaluation of spatial outlier mining algorithm and the improved Bayesian method for usability evaluation of spatial outlier mining pattern are presented here.AG-REX extends the G-REX in two aspects, which make the conventional methods of pattern evaluation more suitable for the pattern evaluation of spatial outlier mining. The two aspects are as followed: (1) AG-REX can extract rules not only from the neural networks, but also from the decision trees, and make the outlier sets as input parameter in order to sort the outlier sets from different initial centers. (2) In the apprehensible analysis, for the outlier is the singularity based on data field, AG-REX method is used to produce the decay tree according to the penitential energy center in clouds theory and data field theory. AG-REX can be used to accomplish the task that to evaluate the exactitude of mining methods and find the simple rules that can be easily accepted. This also makes the method effective for the evaluation of spatial outlier mining pattern and achieve the expected aim. It can conclude from the evaluation of AG-REX that the accuracy of the spatial outlier mining algorithm is great.As conventional Bayesian network evaluation method does not consider the uncertainty of the spatial data, so in this study a effective analyzing tool for the uncertain data that the idea of clouds model and data field is introduced to improve Bayesian evaluation method. This mainly starts with the improvement of three characteristic value (Exi, En], Eei) of the clouds model, and make the three characteristic value (Exi, Eni, Eei) as prior initial values, which can better solve the problem of achieving prior knowledge in clouds model and improve the availability of evaluation. It can conclude from the study that the improved Bayesian network evaluation method can be used to evaluate the availability of spatial outlier mining pattern based on the cloud theory and data field theory and can improve precision of evaluation.5.study on visualization of spatial outlier mining patternsThe clustering based on the clouds theory and data field is gathering the similar objects to be a class and picking up the outlier among them. The clustering and outlier...
Keywords/Search Tags:spatial outlier, data mining, data uncertainty, data field, cloud theory
PDF Full Text Request
Related items