Font Size: a A A

Based On Hadoop Platform Entity Recognition System Realization

Posted on:2013-02-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y L BiFull Text:PDF
GTID:2218330374954330Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
People accumulate and abstract more and more data from production and activitywith the development of science and technology in recent years to make the traditionalinformation system is not fit for handling and calculating the existing data. It shows thatthe physical performance of computer has already reached the maximum basically andthe Moore's Law is losing effectiveness gradually according to research. Theresearchers propose all kinds of solutions aiming at this kind of problem. To achieve thepurpose of improving the performance of computer according to transforming CPU, andfor example, the existing computers are expanded to become with more than onenuclear and it has made a success in practical application. But it doesn't meet the needsof calculating the mass data. The concept of the commercial cloud computing isproposed firstly by Google in2007and puts the study about the cloud computing in thecomputer area in motion further. The existing cloud computing systems co-operate andinterconnect by organizing a group of cheaply-priced computers. And the achievedperformance is the same as the performance of the expensive super-server.In this paper, the data quality technology for cloud computing, Hadoop platform toachieve the MapReduce programming model was designed and implemented the entityrecognition system for large data sets using conditional functional dependencyconstraints for data filtering. The main content and contributions are as follows:(1) Analysis and study on cloud computing. As the reason of that there are a fewworks on data quality in cloud computing, so this paper works on the entity recognitionproblem based on detect inconsistencies using conditional functional dependencies incloud environment.(2) Analysis MapReduce framework and Hadoop platform as well as Hadoopdistribute file system. Through study on scheduling work in Hadoop platform, this paperpropose an merge input jobs method by sharing scan input data mechanism and mapoutput shared mechanism. Give a solution of check conditional functional dependencieson large dataset by using this method. According to the algorithms proposed by thispaper, it can reduce input jobs which may decrease the cost of read and scan input data, and cut down on map output intermediate data.(3) To implement an Hadoop platform in an virtual environment, and detectinconsistencies on real data about genes and proteins by using the algorithms present bythis paper. The result proved that it is an effective method to check dirty data in cloudcomputing.
Keywords/Search Tags:entity recognition, data quality, Hadoop, cloud computing, conditionalfunctional dependency
PDF Full Text Request
Related items