Research On Missing Data Filling Method Based On Shared Knowledge

Posted on:2022-01-15

Degree:Master

Type:Thesis

Country:China

Candidate:H Y Guo

Full Text:PDF

GTID:2518306527998529

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Nowadays,missing data has become an unavoidable factor in terms of datasets’ quality.Missing data is a common phenomenon in various fields,negatively impacting experimental precision,leading to limitations on experimental research and follow-up works’ effectiveness.Therefore,with the popularity of big data research,processing missing data also becomes a hot topic in the data processing.As a result,foreign scholars have launched many investigations and proposed many far-reaching processing methods so far.Subsequently,domestic scholars also began to study ways to deal with the missing data.However,most of these methods are improved based on foreign results.With the intensive increase of datasets in various fields,traditional methods of processing missing data have been difficult to apply to large-scale datasets.Conventional processing methods such as the simple deletion method and mean value filling can no longer meet research needs in many fields.To solve the above problems,this article firstly introduces the first introduces the research significance of missing data processing and the current research status at home and abroad,and then systematically analyzed the reasons and classification for the missing data.Recent processing methods to the missing data were also analyzed in detail,including various methods’ advantages and disadvantages,applying fields,and evaluating metrics.Among them,the filling methods such as EM filling and cluster filling are mainly introduced.The filling method of traditional big data has a simple similarity measurement.It usually only considers the internal connection among original datasets,making the filling data restricted by the original data set easily.Thus,the filling data loses primitive characteristics,leading to biased filling results.As a result,this paper proposed a new concept,"shared knowledge.Based on shared knowledge,this method firstly built a sharing relationship between incomplete datasets and heterogeneous similar complete datasets and then established a shared information system.Moreover,This paper proposed a new similarity metric to build the similarity relationship among different objects.It made incomplete datasets’ objects filled by heterogeneous similar complete datasets’ objects.This article collected two similar datasets related to the World Happiness Index from different platforms.To prove the effectiveness of the proposed method,this paper carried out a simulation experiment.The experimental results showed that the shared similarity measurement method proposed in this paper is better than the traditional numerical similarity metric.It is more suitable for the similarity measurement of large-scale data nowadays.Compared with other traditional filling algorithms,this paper’s method can stably maintain the filling accuracy value of missing values above0.85,and the root means square error value is controlled below 0.15,which fully retains the objectivity of the filling value,and the filling effect is better.By discussing traditional methods,a new idea of missing data processing is proposed,which provides a new direction for the field of missing data processing in the future.At the same time,the results show that the method in this paper can better deal with large-scale data missing in different fields.

Keywords/Search Tags:

large-scale data, missing data, shared knowledge, heterogeny, similarity, data filling

PDF Full Text Request

Related items

1	Research On Improvement ELM Based Filling Approach Of Missing Data
2	Research On Hybrid Algorithm Of Slope One Based On Predicating And Filling Missing-Data By Iterated
3	Research On Missing Data Recovery In Large-scale,Sparse Datacenter Traces
4	Research On Fault Diagnosis And Data Filling Algorithm Based On Deep Learning
5	Robust H_âˆž Control For Uncertain Systems With Both Measurement Data And Control Data Missing
6	Some Studies On Subsampling And Variable Selection In Large-scale Data
7	A Study On Supervised(Transfer Leanring) Clustering For Large Scale Data
8	Design And Implementation On Memory Management Module For Large-scale Seismic Data
9	Research And Implementation Of Data Cleansing Based On Clustering Algorithm
10	Research On Key Technologies For 3D Visualization Of Large-scale GIS Data