Research On Sorted-neighborhood Method And Its Application In Chinese Data Cleaning

Posted on:2019-07-06

Degree:Master

Type:Thesis

Country:China

Candidate:P G Zhang

Full Text:PDF

GTID:2428330566974196

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the development of Internet technology,data analysis plays an important role in all walks of life.In the process of data analysis,how to obtain a perfect and stable data source has become the focus of attention gradually.Many dirty data,such as data redundancy,deletion,uncertainty and inconsistency,directly affect the accuracy of subsequent data mining and the correctness of decision-making.The importance of data cleaning is self-evident.At the same time,as the second largest language in the world,Chinese data cleaning has been widely concerned by scholars.This paper focuses on the research of Chinese data cleaning based on the Sorted-neighborhood Method.There are many kinds of data cleaning,and repeated value cleaning is an important and challenging task.The method of English cleaning with better results is applied to that of Chinese directly in the initial research of Chinese repeated value cleaning.However,with the difference between semantics and usage,research results show that there are two main problems: the traditional cleaning algorithm can not adapt to Chinese semantic environment and can not effectively deal with the Chinese common homophone synonym words,which leads to the difference between the final cleaning results and raw data.In view of the shortcomings of traditional algorithms,a Chinese data cleaning algorithm based on the Sorted-neighborhood Method is proposed in this paper.In this paper,the traditional Sorted-neighborhood Method is applied to Chinese data cleaning directly.The result shows that the accuracy is much lower than that of English.After research,it is found that Chinese semantics is based on words,while traditional Sorted-neighborhood Method can not calculate similarity in terms of words.Meanwhile,the algorithm can not effectively determine whether similarity between synonyms is effective.In view of the above shortcomings,this paper gives an improvement idea: in the process of introducing edit distance to calculate similarity,Chinese participle is used to make every unit of similarity computation from single Chinese character into words to adapt to Chinese semantic environment.On this basis,the synonym lexicon is introduced into the similarity calculation,and the word bank is used as the standard to determine whether the two words are synonyms.The experimental results show that the improved Sorted-neighborhood Method calculate the similarity in terms of words,reflecting the fact that Chinese semantics is based on words,which not only reduces the number of comparison in the process of calculation,saves the running time of the algorithm,but also paves the way for the comparison of synonyms.The accuracy of the improved algorithm is higher than that of the traditional Sorted-neighborhood Method,and the expected effect is achieved.

Keywords/Search Tags:

Data cleaning, Repeated value cleaning, Sorted-neighborhood Method, Edit distance

PDF Full Text Request

Related items

1	Research On Related Algorithms For Chinese Repeated Record Cleaning
2	Research On Data Cleaning Of Website Based On Hadoop Architecture
3	Research And Implementation Of Web Data Storage And Data Cleaning Technology Based On XML
4	Research On Technologies Of Duplicate Record Data Cleaning In Big Data Environment
5	Research On Data Cleaning Based On Science And Technology Innovation Big Data Public Platform
6	Research On Data Cleaning Technology With The Design And Implementation Of Data Cleaning Framework
7	Application Research Of Fine Cleaning Technology In Manufacturing Of TFT-LCD
8	The Research Of Data Cleaning For Data House And Data Mining
9	Research Of Methods Of Data Cleaning For Hotel Entity Based On Edit Distance And Conditional Functional Dependencies
10	Research On Mechanisms Of Laser Rust Removal And Manufacture Of Laser Cleaning Devices