Font Size: a A A

The Research And Application Of Data Fusion Technology In Cloud

Posted on:2016-05-02Degree:MasterType:Thesis
Country:ChinaCandidate:S P PangFull Text:PDF
GTID:2308330473466206Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of network technology, leading to the number of unstructured or semi-structured data is increasing. The IDC report shows that the number of enterprises in the unstructured data accounts for 80% and an annual exponential growth of 60%. If the structured data records the development and the transaction activity of the enterprise, the unstructured data is the key to improve the enterprise competitiveness. So the study of unstructured data is imminent. With the large amounts of data in the cloud, companies hope to find a suitable data processing mode for enterprise through big data analysis and forecasting platform. However, the premise of this is data fusion. It can greatly improve the quality of data and lay a solid foundation for the latter data analysis and mining.However, due to the many forms of unstructured data, fragmented, unable to data interoperability, etc., resulting in the following problems to be solved in data fusion research:(1)Different data sources have the different description for data. These differences include the description of the attribute in a different order, spelling errors on the same attribute and the description on the selection of attributes is inconsistent. (2)Due to the data information has incomplete, obsolete, incorrect and false conflict situation, in order to ensure the accuracy of the analytical data, data fusion need conflict resolution for multiple source data.Data fusion is the assurance of the data quality and the precondition of data analysis at the same time. This paper launches the research for several key problems of data fusion. Main work and contribution can be summarized as the following:(1) For missing of entity description attribute and more variants, this paper proposes a based-learning entity coreference with Hadoop, this improves the efficiency and the accuracy of the data entity resolution.Therefore, current state-of-art approaches employ learning-based approaches and treat entity resolution as a classification problem where each pair has to be classified as either match or non-match. To this end, a suitable classifier is learned using labeled examples of matching and non-matching pairs. The pair-wise similarity values (one for each matcher) serve as features for the classification. For each matcher, all entity pairs inputs need to be utilized, in general, entity coreference is to deal with the data source R and S Cartesian product. In order to resolve this problem and reduce the time of classifier training and application, we use MapReduce to parallel similarity calculation In pairs.(2)For the data information has incomplete, obsolete, incorrect and false conflict situation, this paper propose a based-Bayesian data conflict resolution with MapReduce, this can effectively solve the data integration of data conflict.We combine the truth finder and machine learning strategy, through combining the data quality assessment and Bayesian inference, improve the accuracy. Through the experiments on multiple data sets, this method can better accomplish data conflict resolution, has high efficiency and accuracy.
Keywords/Search Tags:data fusion, entity resolution, data conflict resolution, cloud
PDF Full Text Request
Related items