Font Size: a A A

Research And Implementation Of Subject-Oriented Structured Data Integration On Multiple Web Sources

Posted on:2012-10-29Degree:MasterType:Thesis
Country:ChinaCandidate:Q YuFull Text:PDF
GTID:2268330425991585Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the constant development of the Internet, Web data gradually becomes the focus of people’s attention. On the Web, there is a lot of valuable data, including structured data. Structured data is a special kind of data stored in the database by extracting data of Web page through a specific rule. So it becomes a hot topic as to how to integtate structured data which is from multiple data sources on the Web. Structured data on the Web is a kind of Web data, so it has the features of Web data; and structured data on the Web is stored in the database, so it has the features of database; and structured data on the Web is about a domain, so it has the features of domain. These features led to the many difficulties of the data integration, so the purpose of the paper is to deal with these difficulties.Based on the study of Web data integration and the features of structured data on the Web, the paper proposes a new method of Web data integration using domain knowledge base. Domain knowledge base can offer firmly a foundation for domain knowledge shared and knowledge based reasoning. The paper focuses on several aspects as follows:At first, according to the features of structured data which is from multiple data sources on the Web, the thesis presents the building methods of domain knowledge base and builds core domain knowledge base. Using mobile phone domain as an example, the thesis uses sampling and statistics methods to get the core domain knowledge without the professional guidance of experts. To solve the problem of semantic heterogeneity on mobile phone data multiple data sources on the Web, the thesis proposes an orderly prefix tree-based clustering method and uses the method to mine the relationships between domain knowledge. On the basis of domain knowledg and the relationships between domain knowledge, the thesis constructs the Ontology-based hierarchical categorization system of domain knowledge base.Furthermore, the thesis implements domain knowledge base-based to achieve multiple Web sources of structured data integration. Using data integration of mobile phone as an example, integration process includes data loading, data preprocessing, entity identification, merging duplicate entity and entity outputting. In the process of merging duplicate entity, the thesis proposes the methods of particple and constitutes words for merging text data, and defines synonyms and antonyms, and improves Jaccard coefficient to calculate the similarity of text, and proposes two methods about merging duplicate entity: Web data resources-based merging mothod and similarity-based merging method.Finally, according to the features of Web data, it is proposed that the massive problems of Web data integration can be based on the MapReduce framework.In this thesis, a prototype system has been implemented and some experiments have been done, the results of experiments prove the thesis’s effectiveness and superiority.
Keywords/Search Tags:Hadoop, MapReduce, Cassandra, Ontology, Domain Knowledge Base, DataIntegration, Multiple Data Sources
PDF Full Text Request
Related items