Font Size: a A A

Sales Data Preprocessing Management System Based On The Knowledge Discovery In Database

Posted on:2011-03-18Degree:MasterType:Thesis
Country:ChinaCandidate:S SunFull Text:PDF
GTID:2178360305954647Subject:Software engineering
Abstract/Summary:PDF Full Text Request
A large volume of data is generated in the world everyday continuously and it is always multiplying. These data are vast collection of facts and trends in any field. The task of extracting useful information from the data is challenging. These data are of little use if the experts cannot analyze the facts and identify the trends and patterns. Extracting useful information from raw data is a challenging task. Dealing with the raw data in itself is a complicated task.This study presents a system called Data Preprocessing Management System, for preprocessing the data and managing the preprocessed data which is stored in databases. It is implemented using SQL, a language for developing database applications. Preprocessing of raw data includes removal of missing values and discretization. For removal of missing values techniques used are replacing missing values, delete missing values and customization. Replace missing values technique replaces all the missing values of the attribute with its mean, delete missing values technique deletes all the missing values, and customization replaces missing values with a user defined value. Discretization techniques implemented are Equal Width Discretization, Equal Frequency Discretization and customized discretization. These techniques are used to discretize the preprocessed data. This research mainly focuses on statistics of the data, preprocessing of raw data, and discretization of preprocessed data and management of preprocessed data.Data Mining refers to extracting or mining knowledge from large amounts of data. Data mining is also referred to as Knowledge Discovery from Databases (KDD),knowledge extraction, data/pattern analysis, data archaeology, and data dredging.Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.Data preprocessing transforms the data into a format that will be more easily and effectively processed for the purpose of the user. The techniques to preprocess data include data cleaning, data integration, data transformation and data reduction. Data cleaning routines work to clean data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. If we would like to include data from multiple sources in our analysis, then we would integrate multiple databases, data cubes, or files, that is, data integration. Data transformation operations include normalization and data aggregation. Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same analytical results. Strategies for data reduction include data aggregation, dimension reduction, data compression, and numerously reduction.Data preprocessing management system is the combination of both preprocessing of data and management of the preprocessed data. Preprocessing of data includes removal of missing values and discretization. Depending on the technique used, different files will be created; all these files will be managed by the management system. By removing the missing values three different files will be created, these will be stored by different names. On the client side, the user will be able to view and download these files. The user can select any file as an input file for doing the discretization. In discretization section, three different files will be created; these will be stored using different names and the user will be able to view and download these files. At the end, a list of all the tables created and the techniques implemented will be displayed in a table to the user. Prior to updating the tables into the database, the user will be able to see the distribution of each technique, using which, the user can compare each technique and can select the appropriate technique to update the table. In this system the user can select any file and can compare it with other files using the statistics.Real world data consists of noise, inconsistency and missing values. There are many techniques available to preprocess the raw data. But very few editors tried to manage the preprocessed data. Although many features are available through the predefined editors, the whole process still requires the user to fully understand the whole structure and associated objects in the definition process. This research tried to preprocess the data and manage the preprocessed data. Preprocessing of raw data is done in two steps; preprocess missing values and discretization of preprocessed data. Three basic techniques are implemented to preprocess missing values; EQW, EQF and customized discretizers are implemented to discretize the data. All the tables created and used are managed in this research. The user will be able to download any table and can use the table for further processing. The KDD process extracts knowledge from original data and it begins with the original database from which the knowledge will be extracted. This original data will be used for the whole KDD process. Before the data undergoes data mining, they must be prepared in a preprocessing step that removes or reduces noise and handles missing values. Relevance analyses for omitting unnecessary and redundant data, as well as data transformation, are needed for generalizing the data to higher-level concepts. Preprocessing techniques take the most effort and time, i.e. almost 80% of the whole project time for knowledge discovery in databases (KDD). The preprocessing step is vital for successful data mining.The design of data preprocessing management system is explained in two section; conceptual model and design of the stages. Conceptual model explains the basic details from the business perspective. This is explained using the entity-relationship diagram assuming all the major parts as entities and the methods involved as relationships. Second section which is the design of the stages involved explains all the stages in detail.The last stage of this research is the management. In this stage all the raw data and preprocessed data will be managed by storing all the data in different tables and providing an option for the user to view and download all the data at the client side. At this stage all the table names which are created and the techniques applied on those tables will be displayed so that the user can use these tables for further processing. In this section the preprocessing is implemented. In this page, three basic techniques of preprocess missing values were implemented. All the attributes will be displayed in a RadioButtonList as shown in Figure 4.5. If an attribute is selected, its base table statistics will be displayed. The three implemented techniques are: delete missing values, replace missing values, and customization. Customization technique replaces all the missing values with a user defined value. Delete missing values technique deletes all the missing values. Replace missing values technique replaces all the missing values with the mean of the attribute if it is numeric, and if it's nominal then, all the missing values will be replaced by most occurring value. The user can compare each technique using the statistics and distribution.Data preprocessing management system preprocesses the raw data into a format, which will be more easily and effectively processed for the purpose of the user. Using this system, the user can easily compare each technique, by using the distribution and statistics of all the techniques. Data and statistics are managed at every stage for better understanding of the techniques. In all the three techniques, Equal Width is simple to implement but has the disadvantage of outliers. Equal Frequency is also simple to implement and it doesn't have the problem of outliers but it's not true, that always all the clusters will have equal number of values. Coming to the customization user can select the bin ranges but, the algorithm has to be improved considering all the possibilities. More work has to be done in data management。...
Keywords/Search Tags:Data mining, KDD, Data pre-processing, Discretization
PDF Full Text Request
Related items