Font Size: a A A

The Research And Implementation Of Algorithms On Data Preprocessing In DW

Posted on:2005-12-01Degree:MasterType:Thesis
Country:ChinaCandidate:C H HeFull Text:PDF
GTID:2168360125450571Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays ,E-commerce not only provides convenient trading mode and wide selection for the customers, but also provides the business executives with the possibilities of deep understanding on the customers' requirements and purchase behaviors. New technology on data storage and processing—Data Warehouse can complete all kinds of complicated analysis in support of the strategic decisions. According to W. H. Inmon, a leading architect in the construction of data warehouse systems,"a data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management's decision making process." Data warehousing provides architectures and tools for business executives to systematically organize, understand, and use their data to make strategic decisions. DW mainly has three functions. First, it provides enterprise-ranked report forms and diagrams, secondly, it supports Multi-dimensional Analysis, thirdly, it is the key foundation of Data Mining technology and Decision Support System (DSS).In Business Intelligence, Correct and complete DSS must base on high quality data and right information. That is to say ,we must ensure the Accuracy,Integrity,Consistency,Completeness of the data in the Data Warehouse. At present, most database companies ,such as Microsoft ,Oracle,IBM, provide the data processing products of their own .For example, DTS, Data Transforming System ,which is provided by Microsoft ,is a application program with strong functions that the system can integrate all kinds of different data sources and then immigrate these data into data warehouses and data marts.On the whole,the transformed data will be normalized, but in the dataset there exist the potential outliers and replicate data which belong to the same entity. In order to obtain the data with high quality and support correct and complete business strategic decisions, it is essential for us to find and delete these outliers and replicates.Outlier detection in high dimensional data is necessary in order to obtain high quality data. Many recent algorithms use concepts of proximity in order to find outliers based on their relationship to the rest of the data. However ,in high dimensional space, the data is sparse and the notion of proximity fails to retain its meaningfulness. In fact, the sparsity of high dimensional data implies that every point is an almost equally good outlier from the perspective of proximity based definitions. Consequently , for high dimensional data, the notion of finding meaningful outliers becomes substantially more complex and non- obvious. In this paper ,we discuss new techniques for outlier detection which find the outliers by studying the behavior of projections from the data set.In order to detect the potential outliers in the dataset, the thesis is just to implement outlier detection for high dimensional data using evolutionary theory .By analyzing the data distribution of all kinds of attribute groups, we can detect the potential outliers in the dataset.During the implementation of the algorithm, the probabilities of selection, crossover and mutation are set. At the same time, there is some improvement as to optimized but time-consuming crossover. During the selection of individuals which are prepared for crossover, some changes are as follows. The current population are divided into two types, the better individual set and the worse individual set. When it comes to crossover, select one from the better individual set and the other one from the worse individual set to crossover and obtain next new individuals. But if neither of new individuals can approach the final solutions more than their parent individuals, we will give up this crossover and the individuals will be resumed, and then reselect the two individuals to crossover. Another optimization is on mutation. During the implementation of mutation, select the worst individual to mutate and ensure the mutated individual to approach the final solutions than the parent individual...
Keywords/Search Tags:Implementation
PDF Full Text Request
Related items