Font Size: a A A

Key Technologies Research On Personalized Web Business Information Fusion System

Posted on:2011-08-31Degree:MasterType:Thesis
Country:ChinaCandidate:W B SuFull Text:PDF
GTID:2198330332978554Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the continuous development of Internet technology, people who use the Internet technology are no longer just content with information retrieval. They need a new technology which can extract information which they are interested in from the information which is retrieved by current technology. Information fusion system is aimed to help the users with information retrieval and information extract, and becomes a research point currently with a wide range of applications. The Key technologies of WEB information fusion are information crawl, information extraction, data cleaning, information retrieval and information storage. This paper combines with the undertaken major science and technology project, focuses on information extraction and data cleaning.First, this paper introduces the background of research and subject of the information fusion module of Personal Information Push Services Project, and point the existing problems of robust and data quality in the two key technologies of information extraction and data cleaning. The problems mainly are in the low matching accuracy and low extraction efficient in information extraction and low data quality in data cleaning. At last of this section this paper introduces the main work and structures of this paper.The second part of this paper describes the key technologies, models, standards in the information fusion project. As information fusion project is aimed to process the huge information of WEB, this system uses a distributed processing framework called Hadoop, and introduces the related information of Hadoop. At last of this part, this paper describes the current research status of information extraction and data cleaning at home and abroad.The third part of this paper proposes a dynamic Anchor-Hop model based on Anchor-Hop model for the existing problems which are low efficiency and low accurate matching in the Anchor-Hop model which uses content and attributes to locate the anchor point. It is 30% faster than Anchor-Hop model in extraction experiment and the accuracy of extraction is also higher than Anchor-Hop model. In the data cleaning problem, this paper first describes the existing research on data validation, and determines the method of data validation in information fusion system, and also proposes an idea that uses the result of data validation as the feedback to the extraction system to its reliability. Second, this paper focuses on the algorithm of eliminating data duplications called SNM and field matching algorithm based on edit distance. On basis of these two algorithms this paper proposes an improved algorithm called SSNM, which first splits the key string into word segmentations, and then sorts the word segmentations and merges into a new string, then sort all the records according to the new string, and use SNM algorithm to detect data duplications. When the algorithm is computing the similarity of the two records, use the new string to compute the edit distance. The experiment results show that SSNM is better than SNM algorithm on recall rate. At last of this part, this paper gives a detailed description of the design and implementation of SSNM algorithm based on Hadoop framework.Finally this paper introduces the overall framework of information fusion system, the detailed architecture of each sub-module, and key technologies of the implementation of each sub-module.
Keywords/Search Tags:Information Extraction, Anchor-Hop Model, SNM Algorithsm, Similar Duplicate Data, Data Cleaning
PDF Full Text Request
Related items