Font Size: a A A

Research On Data Source Selection Technology For Missing Value Filling

Posted on:2021-04-18Degree:MasterType:Thesis
Country:ChinaCandidate:H Z XieFull Text:PDF
GTID:2438330602998309Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the arrival of the big data era,information science and technology are developing rapidly,and the scale of data shows a trend of rapid growth.However,in massive related data,the problems of data quality are universal.Data quality is one of the important standards to measure whether the data is qualified or not.Normally,data quality metrics of completeness,accuracy,consistency,and freshness are utilized to evaluate the quality of the data.Data with poor quality will seriously affect the information application in the big data information era.Misunderstanding of information will add a lot of inconvenience to people and even cause disastrous consequences.Therefore,there is an urgent need to take measures on data quality related issues,and data quality issues have also become a hot research direction.Incomplete data is one of the important aspects in data quality problem,and ensuring the integrity of data has becoming more and more important.Because it is very important to ensure the integrity of the query answer in many scenarios.Specifically,incomplete data means that the data set does not contain enough information to answer the query,which is divided into attribute value missing and tuple missing.How to fill data based on integrity is one of the basic problems in the research of data quality.The existing methods for filling missing values can only through calculation or random filling which cannot accurately fill missing values,and these methods cannot consider the integrity of data at the same time.Therefore,this paper proposes a strategy based on integrity with other data sources for filling missing values.Considering the tuple integrity and attribute value integrity at the same time,this paper systematically studies how to fill missing values based on integrity.Due to the rapid development of modern information,the number of data sources is also increasing.Accessing too many data sources will bring huge overhead,which makes the cost of filling missing values with the use of data sources too high.Therefore,this paper focus on the issue of how to choose a suitable data source.1.This paper studies a new data source selection strategy for filling missing attribute values.On the issue of filling missing attribute values based on integrity,this paper proposes a data source selection strategy based on minimum hash signature.Firstly,we define a gain model for data source selection based on the attribute value missing to maximize the income of the data source we select.Then the min-hash technique is used to execute effective data source selection by utilizing the signature of the data source without accessing the data source.An approximate greedy algorithm is designed to solve this NP-hard problem.Our designed algorithm is comparable to the traditional greedy algorithm in accuracy,and the efficiency is obviously superior to the traditional greedy algorithm.Experiments on real datasets and composite datasets proved the superiority of the proposed data source selection method using minimum hash signature in accuracy and efficiency.2.This paper studies a new data source selection strategy for filling missing tuples.On the challenge of filling missing tuples based on integrity,this paper proposes a data source selection strategy based on genetic algorithm.Firstly,we define a gain model for data source selection based on tuple deletion to maximize the income of the data source we select.Then we propose a strategy of searching the optimal data source utilizing genetic algorithm,which ensures the integrity of the target data source after filling it.We transform this problem into a 0-1 integer programming problem,and we use the genetic algorithm with constantly crossing and mutating to select the most suitable data source to fill.The algorithm uses a high-quality process to search an optimal solution,which shows good performance and high scalability on real datasets and composite datasets in experiments.
Keywords/Search Tags:Data quality, Data Completeness, Source selection, Min-hash, Genetic Algorithm
PDF Full Text Request
Related items