Research On Data Source Selection Technology For Missing Value Filling

Posted on:2021-04-18

Degree:Master

Type:Thesis

Country:China

Candidate:H Z Xie

Full Text:PDF

GTID:2438330602998309

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the arrival of the big data era,information science and technology are developing rapidly,and the scale of data shows a trend of rapid growth.However,in massive related data,the problems of data quality are universal.Data quality is one of the important standards to measure whether the data is qualified or not.Normally,data quality metrics of completeness,accuracy,consistency,and freshness are utilized to evaluate the quality of the data.Data with poor quality will seriously affect the information application in the big data information era.Misunderstanding of information will add a lot of inconvenience to people and even cause disastrous consequences.Therefore,there is an urgent need to take measures on data quality related issues,and data quality issues have also become a hot research direction.Incomplete data is one of the important aspects in data quality problem,and ensuring the integrity of data has becoming more and more important.Because it is very important to ensure the integrity of the query answer in many scenarios.Specifically,incomplete data means that the data set does not contain enough information to answer the query,which is divided into attribute value missing and tuple missing.How to fill data based on integrity is one of the basic problems in the research of data quality.The existing methods for filling missing values can only through calculation or random filling which cannot accurately fill missing values,and these methods cannot consider the integrity of data at the same time.Therefore,this paper proposes a strategy based on integrity with other data sources for filling missing values.Considering the tuple integrity and attribute value integrity at the same time,this paper systematically studies how to fill missing values based on integrity.Due to the rapid development of modern information,the number of data sources is also increasing.Accessing too many data sources will bring huge overhead,which makes the cost of filling missing values with the use of data sources too high.Therefore,this paper focus on the issue of how to choose a suitable data source.1.This paper studies a new data source selection strategy for filling missing attribute values.On the issue of filling missing attribute values based on integrity,this paper proposes a data source selection strategy based on minimum hash signature.Firstly,we define a gain model for data source selection based on the attribute value missing to maximize the income of the data source we select.Then the min-hash technique is used to execute effective data source selection by utilizing the signature of the data source without accessing the data source.An approximate greedy algorithm is designed to solve this NP-hard problem.Our designed algorithm is comparable to the traditional greedy algorithm in accuracy,and the efficiency is obviously superior to the traditional greedy algorithm.Experiments on real datasets and composite datasets proved the superiority of the proposed data source selection method using minimum hash signature in accuracy and efficiency.2.This paper studies a new data source selection strategy for filling missing tuples.On the challenge of filling missing tuples based on integrity,this paper proposes a data source selection strategy based on genetic algorithm.Firstly,we define a gain model for data source selection based on tuple deletion to maximize the income of the data source we select.Then we propose a strategy of searching the optimal data source utilizing genetic algorithm,which ensures the integrity of the target data source after filling it.We transform this problem into a 0-1 integer programming problem,and we use the genetic algorithm with constantly crossing and mutating to select the most suitable data source to fill.The algorithm uses a high-quality process to search an optimal solution,which shows good performance and high scalability on real datasets and composite datasets in experiments.

Keywords/Search Tags:

Data quality, Data Completeness, Source selection, Min-hash, Genetic Algorithm

PDF Full Text Request

Related items

1	Research On A Model Of Data Completeness And Evaluating Algorithms
2	Accuracy and Completeness as Measures of the Quality of Volunteered Point-Feature Geospatial Data and Evaluation of the Effect of Demographics on that Quality
3	Research On Deep Web Data Source Selection Method Based On Sampling
4	Research Of Data Source Selection With Similar Theme In Deep Web Integrated System
5	Data Quality Assessment Model And Quality Propagation For Relational Database
6	Technology For Answering Queries On Incomplete Data
7	Research On Data Source Selection Algorithm For Inconsistency Detection
8	Research On Data Source Selection And Result Cache On Deep Web
9	The Research Of Data Mining Based On Enterprise Data Warehouse Of Telecom
10	Research On Data Source Quality For Sensor Cloud