The Study Of Data Pre-Processing On Sciencepaper Online

Posted on:2011-08-06

Degree:Master

Type:Thesis

Country:China

Candidate:Y Y Ma

Full Text:PDF

GTID:2178360305457107

Subject:Management Science and Engineering

Abstract/Summary:

With the development of technology and the application of database, data has been a valuable resource of modern business. It has become a hot topic for the application in various fields about how to use these potential data effectively and extract useful information from them. To conduct data mining, the first thing is to ensure the quality of the data. High-quality data can improve the accuracy and effectiveness of data mining. Modern data mining researches are mostly for the algorithm. If only focusing on algorithm of data mining and neglecting the research on data preprocessing, certain key significance of data mining will loss to some extent. Original data is dirty, incomplete and inconsistent in the real world, few of which can meet the needs of algorithm directly. In the database of the real world, there are many meaningless data which will seriously affect the speed and efficiency of algorithm implementation. Therefore, preprocessing original data has become an indispensable prerequisite of data mining.This thesis is a study of data preprocessing based on sciencepaper online. It discusses the importance of data preprocessing skill in data mining and describes the tasks of data preprocessing. Tasks of data preprocessing mainly involve attribute preprocessing, data scrubbing, data transformation, data reduction and discretization. Attribute preprocessing mainly refers to attribute construction and attribute deletion. Attribute construction can construct new attributes through the original ones and make the new ones beneficial to data mining in the next step. Attribute deletion is to delete tasks unrelated to mining task, reduce redundant attributes and attributes unrelated to mining objectives. Data scrubbing is realized mainly by filling up vacancy, eliminating noise, dealing with inconsistent data and data duplication. Data scrubbing based on sciencepaper online in this thesis mainly refers to vacancy processing in author attributes and literature attributes, and inconsistent data processing and noise data processing. Data transformation aims to transform raw data into data suitable for mining. Data transformation used in this thesis mainly refers to data generalization which means to generalize low-level raw data in high-level concept. Data reduction is realized by attributes'numerical reduction which mainly refers to the further compression of data for reducing data amount in data mining. Discretization is to discretize the attribute values in partition. Data discretization is a request for some algorithm implementation. Data discretization in this thesis is mainly for the subsequent algorithm of data mining. This thesis is for the first time to apply data preprocessing to the study of sciencepaper online and cyclical data preprocessing is used. Various tasks of data preprocessing are in a cyclical process instead of being independent of each other. Although there are certain steps, they are not static.In this thesis, the study of data preprocessing based on sciencepaper online mainly use cluster analysis and principal component analysis. Cluster analysis is commonly used in people's actual work and its principle is very simple, clustering process is clear which is easy to understand. Clustering segmentation is mainly to put attributes with similar characteristics together, observe the correlation coefficient among them and conduct numerical reduction. Based on the cluster analysis on downloads, visits, number of authors, etc., attribute indicators of literature popularity and literature concern and the literature itself are identified. And attribute dimension has been compressed from previous multi-dimension to smaller one by virtue of clustering segmentation. This thesis also studies the principal component analysis and has a rather deep study on its definition, basic ideas and steps. It describes the advantage of principal component analysis on data preprocessing and introduces the application of principal component analysis in practice as a way of data dimensionality reduction. This thesis conducts principal component analysis on certain attribute indicators of literature popularity and identifies comprehensive indicators which affect the study. These identified comprehensive indicators not only keep the main information of the original attribute indicators and are more superior, but also reduce data amount in data mining greatly.This thesis is for the first time to apply data preprocessing to the study of sciencepaper online and cyclical data preprocessing is used on its literature attribute which makes the new data be more in line with the subsequent algorithm of data mining. This thesis is just a little study on data preprocessing skill with the hope of providing a little help for data preprocessing hereafter. Data preprocessing has become an indispensable prerequisite for data mining. To get high-quality data in data mining and serve deciders and users better, the first thing is to do data preprocessing. Therefore, data preprocessing has become a necessary task for workers nowadays.

Keywords/Search Tags:

Data mining, data preprocessing, cluster analysis, principal component analysis

Related items

1	Application Of Principal Component Analysis And Clustering In Science And Technology Data Analysis
2	Research On Clustering Preprocessing Of Data Resource And Its Application
3	The Application Of Data Mining In Comprehensive Assessment Of National Area
4	The Application Of Cluster Analysis And Principal Component Regression In Industrial Statistics Data
5	Routing Algorithm Base On Principal Component Analysis Theory For WSN
6	The Application Of Cluster Analysis Algorithm In HMIS
7	Based On The Web Server Log Mining Data Preprocessing Technology Research
8	Diagnosis And Improvement Research Of Enterprise Financial Status Based On Data Mining
9	Validation Research On Principal Component Analysis And Cluster Analysis Of Interval Symbolic Data
10	The Application Of Principal Component Analysis And Neural Network In Indus Trial Economy Data