Research On Technologies Of Duplicate Record Data Cleaning In Big Data Environment

Posted on:2020-10-08

Degree:Master

Type:Thesis

Country:China

Candidate:L Li

Full Text:PDF

GTID:2428330590495829

Subject:Electronic and communication engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet and mobile Internet,a large amount of data is expanding at an extremely fast speed,and the world has entered the era of big data.Big data implies great value,and people are increasingly hoping to extract valuable information from a large amount of data for management,decision-making and regulation.In general,data mining always assumes that data is �clean� and consistent.However,in reality,perceived data is often redundant,incomplete,erroneous,and inconsistent.The existence of these phenomena will reduce data quality and seriously affect the results of data mining,thus affecting the correct formulation of decisions.Therefore,the data preprocessing is first performed on the collected big data to improve the data quality,which has important significance for data mining.In big data preprocessing,data cleaning is one of the important means to ensure data quality.In data cleaning,similar duplicate recording cleaning is often used in data deduplication processing to remove a large amount of redundant data in big data,which plays a vital role in improving data quality.In the similar duplicate record cleaning process,data similarity detection is the basis.At present,the research on data similarity detection methods is mainly focused on the literal similarity detection of strings,and some research results have been obtained,but the accuracy of data similarity detection results calculated according to the existing methods is not very high.However,in the semantic-based word similarity detection,the research results are relatively few,and there are still many shortcomings in the related detection methods.Therefore,it is necessary to study high-precision data similarity detection methods,which is of great significance for improving data quality.In this thesis,the research work aims to improve the accuracy of data similarity detection.For the Chinese and English characters in big data,the word-based string similarity detection method and the semantic-based word similarity detection method are studied in depth.The main content and innovations are as follows:In the aspect of literal-based string similarity detection method,an improved edit range-based string similarity detection method is proposed.Based on the similarity detection method based on edit distance,the data is considered publicly.The effect of subsequences and common substrings on similarity is given by combining the edit distance,the longest common subsequence and the longest common substring,and a new string similarity measure expression is given.The experimental results show that the new method has more moderate standard deviation and extreme difference,which makes the similarity calculation result more reasonable,and the new method has higher accuracy and flexibility,and has good practicability.In the aspect of semantic-based word similarity detection method,an improved method based on HowNet for semantic similarity detection is proposed.The new method mainly studies the problems existing in the commonly used method of semantic similarity detection based on HowNet,because the traditional method does not deeply consider all the meanings of the reachable paths of two different meanings in the same tree.The influence of the density of the original node on the distance of the original,or the primary and secondary relationship between the depth of the original and the density of the original,does not make the calculation result of the similarity inaccurate,which makes the application of the method limited.To this end,the new method defines a new edge weight function between nodes,by introducing the density of all the original nodes on the original meaning path in the edge weight function,and using the weight factor to adjust the depth of the original and the density of the original.The influence on the distance of the Yiyuan,thus improving the accuracy of the calculation of the similarity.The experimental results show that the new method can improve the accuracy of semantic similarity calculation of words more effectively and is more practical than the existing methods.

Keywords/Search Tags:

Big data, data cleaning, similarity detection, edit distance, string similarity, HowNet, semantic similarity of words

PDF Full Text Request

Related items

1	Research On String Edit Similarity Join
2	Research On Similarity Query Over Sequence Data
3	An Improved String Similarity Join Algorithm
4	An Algorithm Of Computing String Similarity Based On Improved Levenshtein Distance
5	Research On Top-k String Similarity Search Based On Edit Distance
6	Research On The Modular Chinese Sentence Similarity Computing Based On Hownet
7	Sequentially Matching Similarity String Algorithm Research
8	The Research Of Semantic Similarity Computing Algorithm Based On HowNet
9	Research On Text Similarity Measure Method Of Combining New Word Analysis And Semantic Analysis
10	Chinese Words Semantic Similarity Measure Research Based On Common Sense Knowledge Base