Font Size: a A A

Research On The Issues Of Semantic Annotation Based Automatic Metadata Construction

Posted on:2011-12-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:H X LiuFull Text:PDF
GTID:1118360305999629Subject:Systems analysis and integration
Abstract/Summary:PDF Full Text Request
To solve a good deal of problems in the age of network information "explosion", metadata as an important method and measure has been widely used in information retrieval, information integration, information sharing and so on. There is no question that good or bad quality of metadata itself determines the ultimate success or failure of metadata application services. In order to improve the quality of metadata, academia and industry made a lot of research and exploration mainly from the following aspects:First, set standards related to metadata, establish a unified metadata standard to effectively ensure its consistency and integrity, also to achieve normative interaction, this point has been widely recognized by the research workers; Second, construct metadata, improve and perfect the management methods, it's another way to improve the quality of metadata, at present, metadata schema discovery, schema transformation, control strategy, administration mechanism and many other aspects have been widely carried out; Third, study for metadata quality assessments, academic discussion of this issue focused on several aspects such as evaluation indicators, evaluation methods, evaluation use cases and so on. From the current literatures, we found that the existing research works are more often started from the angle of manually creating metadata, considered about the effectiveness and convenience of creating tools. However, thinking about the creator and the user of metadata, which will give rise to problems such as the following:For the creator, facing with a large number of diverse forms of data sets, metadata creator need to take some effort to understand the contents of the data sets until the contents of data sets are deeply understood. It will surely be a cumbersome and heavy work, in addition, different creators have different understandings, which can lead to ambiguity in the understanding of metadata; from the view of users, they need to have a correct understanding for the predefined metadata, otherwise, there would be "gap" between creators and users on the knowledge, the user naturally can not effectively query information on demand.Therefore, in order to solve the above problems, and to build high-quality metadata services, this paper presents a method based on semantic annotation to build metadata, using the existing semantic annotation in data sets to automatically build the metadata. This method is given to build metadata efficiently, and it fully borrows idea of knowledge sharing, exploring the feasibility of elimination of subjective perception "gap" using multi-angle of semantic annotation, and strategies on metadata identification in different structure views. On the basis, this paper further studies heterogeneous problems of metadata schema, and proposes a schema matching method for semantic integration of metadata schema. In order to validate its applicability, this paper proposes a metadata query method for effectively improving the precision and inhibiting result loss caused by low recall. This paper locates in the the field of archive information resources in experimental designs and test data sets, considering its own unique value and its important position in basal information resources [1]. Specifically, our studies mainly cover the following aspects:(1)Come up with a method of automatically constructing metadata called SAMC, based on the analysis of two main metadata extraction methods:template-based and machine learning-based. This method can overcome shortcomings and disadvantages of above methods, not only can effectively identify metadata from existing semantic annotation, but also organically combine statistical theory with the structural features of information and visual layout characteristics, providing a guarantee for performance of SAMC. So, our method has higher precision and greater ability to express information, and can well meet requirements of building high-quality metadata.(2)Come up with related algorithms for identifying metadata from different layout patterns. To improve feasibility of our method, this paper considers the differences in structure views, and focuses on the differences in characteristics demonstrated by summary-detail, iterative, integrated sequence patterns, and designs corresponding algorithm of identifying metadata. The algorithms use hierarchy of tree structure, order of linear structure and information characteristics such as frequency distribution, so that these result in good effects in metadata identification.(3)Put forward a schema matching method for attribute-level integration of metadata schema called PISMatching. Compared with related works, this research is facing new issues for the purpose of enriching semantic of metadata schema, and for the task of merging of metadata schema from multiple data sources. This paper tries to combine ontology with thesaurus and concept similarity for integrating their respective advantages, and has better performance in difficulty of implement, complexity, semantics richness and so on. Ontology provides a strong context domain support for improving matching accuracy, and concept similarity based on related information and probability provides a new metric for schema matching, which can dig out those properties with positive correlation to get potential properties groups, and also reserve properties groups with synonymous. On concrete designs, this paper pays more attention to matching sort rather than the gap between calculated values, which is more meaningful to the practical application; And pay more attention to capture available information, and reduce dependence on a specific schema, this will make research more flexibility, scalability and wider use-value.(4)Come up with a metadata query method of measuring field context called MFCQuery. Compared to traditional method, in order to have further improved in precision and recall, MFCQuery Mainly extends two aspects from following:first, establish similarity matrix between user query and metadata field context by vector space model, and determine real query intent by similarity between field context and user query to improve recall; Another aspect, considering that some users can not provide necessary metadata fields query, may be due to a lack of sufficient background knowledge, we will match the most relevant target field for restricting query to improve precision. The method not only can ensure high-precision, but also can further enhance recall.(5)Detail evaluation of metadata. From the starting point, all the works in the paper main aim to effectively improve quality of metadata in order that it can play a greater role in specific applications. So, this paper selects archive information domain as target applications for our experiments. For evaluation of metadata quality, we think that it can not be simply reflected from classic evaluation indicators of information technology such as recall and precision, therefore, this paper attempts to detail evaluation indicators, and uses a more refined approach to make a evaluation for objects with different characteristics, this will reflect the impact on different methods on metadata quality at a more detailed level.In a word, this paper makes a deep study in related technologies of metadata from above aspects by rules, statistics, probability and other methods. Address key issues during construction of metadata, and improve precision and recall of generating metadata; Enhance applicable capacity for integrating different metadata schemas; Improve performance of users'active queries, and not only further improve recall, but also improve the precision. In these efforts, We made a series of research achievements.
Keywords/Search Tags:Metadata, Data Management, Information Extraction, Schema Matching, Information Retrieval
PDF Full Text Request
Related items