Font Size: a A A

Study On Data Quality Assessment Techniques For Telecom Data Mining

Posted on:2011-04-17Degree:MasterType:Thesis
Country:ChinaCandidate:X H WangFull Text:PDF
GTID:2178360302483910Subject:Control theory and control engineering
Abstract/Summary:PDF Full Text Request
In recent years, as an effective method of knowledge extraction, data mining has been widely used in the field of telecommunication, for example, telephone fee fraud detection, customer subdivision, customer churn prediction, cross selling, etc. However, telecom data is usually of poor quality, which can not meet the requirements of data mining, and that is why there are so few successful cases of telecom data mining. Poor data quality has already been the bottleneck of data mining's application to the filed of telecommunication, so it is necessary to do the data quality assessment to measure the availability of data mining, and thus to avoid the unnecessary waste of time and energy. For data quality assessment, there are many research achievements on that subject, but most of them focus on the frame theory, which are seldom related to specific business backgrounds and applications. Till now, there is still no specialized literature discussing about specified mining subject oriented data quality assessment. Aiming at insolvency mining, which is one of the most commonly used subjects of telecom data mining, based on deep research of how missing values and outliers affect the classification result, this thesis conducts a research on data mining oriented data quality assessing techniques. The main research work is listed as follows.1. For Missing Evaluation, the concept of Class Distribution is proposed to measure the relationship between an input attribute and the target attribute, and based on that, a Class-distribution-based Attribute Weighting Algorithm (CAWA), which can discriminate the importance of different input attributes, is presented. Based on CAWA, an Attribute-weight-based Missing Evaluation Algorithm (AMEA) is presented to realize Missing Evaluation. The experiment results show that this missing evaluation algorithm can reasonably measure the affects to mining results caused by missing values.2. For Outlier Evaluation, aiming at the characteristics of the telecom data, especially the imbalanceness of insolvency data, the affects to classification results caused by outliers in imbalanced datasets are analyzed, and the concept of Outlier Degree (OD) is proposed based on the combination of Hyper-graph Outlier Test (HOT) Algorithm. Based on that, an Imbalanced Outlier Evaluation Algorithm (IOEA) is presented to realize Outlier Evaluation. The experiment results show that this outlier evaluation algorithm can reasonably measure the affects to mining results caused by outliers.3. Based on Missing Evaluation and Outlier Evaluation, combined with telecom insolvency data mining's own characteristics, a comparably complete data quality assessing system is presented. This system is composed of Missing Evaluation Sub-system and Outlier Evaluation Sub-system. Based on the experiments, combined with the experience of telecom experts, a reference value of assessing point vector is given. The results of the experiment show that this reference value can provide meaningful guidance to the mining feasibility analysis.
Keywords/Search Tags:Telecom, Data Mining, Insolvency, Data Quality Assessment, Missing Value, Imbalanced Data, Outlier Degree
PDF Full Text Request
Related items