Font Size: a A A

Research On Learning-based Entity Parsing Methods In Data Warehouses

Posted on:2018-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y H FanFull Text:PDF
GTID:2358330518460439Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The entity resolution is a redundant identification technology for data quality management in data warehouses.With the further increase of massive data,the low efficiency of identification and the low recognition in the traditional entity analysis method are highlighted.This thesis is meant to give a research summary on the production and impact of data quality problems in data warehouse from several aspects.It analyzes some related theories between data warehouse and data quality.It also refers to some relevant reports in China and overseas as well as the main research methods of entity analysis.This thesis focuses on the research and analysis of the principle,basic model,module design and evaluation standard of mass data entity analysis algorithm.In order to maximize the recognition accuracy and minimize the time complexity,aiming at the data source of a tobacco Group data center,the parallel entity analysis algorithm based on learning is studied,then validate the method by simulation.The main research contents are as follows:(1)The Canopy set threshold is determined by the similarity of key attributes in the tuple,so we use the Canopy clustering on the massive data entities for the initial block,which makes the tuple form a superimposed subset and increases the fault tolerance of the algorithm.(2)To solve the problem that a set of similar entities are formed after the data is fragmented,it means the tuple similarity can be calculated by the combination of position coding technique and TF-IDF algorithm.Position coding technology makes the identification for the word abbreviation more easily.The TF-IDF algorithm gives the corresponding weight information to the words that have the ability to distinguish between classes in the attribute string,but it is insensitive to the order of characters.The advantages of the two algorithm are combined with the extraction of the feature vectors.(3)On the nonlinear mapping relationship between tuple similarity and attribute similarity,we use neural network to approximate nonlinear function with arbitrary precision and the intrinsic relationship between attributes to achieve dynamic weights,thresholds and other parameters to tell whether the entity matches or not.Also,ant colony optimization algorithm is used to solve the problems such as slow convergence in the neural network training process and the tendency for local optimum.Besides,it makes up for the deficiency of the traditional entity matching method to judge whether the tuple belongs to the same entity according to the weight of the attribute similarity and whether it is larger than the artificial threshold.(4)Further,the Hadoop infrastructure is implemented in parallel with the analysis of massive entities.Using the supplier data of data center to analyze the methods and frameworks.By comparing with the traditional entity analysis algorithm,such as accuracy rate,recall rate and F1 value evaluation method,it is verified that the learning algorithm based on learning can obtain high recognition accuracy and the number of nodes increases with the number of nodes.Recognition efficiency is also greatly improved.
Keywords/Search Tags:Data Warehouse, Data Quality, Entity Resolution, Autonomous Learning, Parallel Computing
PDF Full Text Request
Related items