Font Size: a A A

Research On Semantic Annotation For Domain Documents

Posted on:2010-06-08Degree:MasterType:Thesis
Country:ChinaCandidate:L H ShaFull Text:PDF
GTID:2178360272496060Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Semantic Web is about adding formal semantics (metadata, knowledge) to the web content for the purpose of more efficient access and management. Semantic Retrieval means that the retrieval system can understand the intentness of users and identify the semantics of web documents, thus can return the information users really concern. Regardless of Semantic Web or Semantic Retrieval,the basis is adding semantic information to web documents. Semantic Annotation is a technology of adding semantics to Web documents. The acquisition of this kind of semantic information is based on ontology, so accuracy of Semantic Annotation depends on the maturity of ontology. However, there are various concepts and vocabularies in our real world. Therefore, it is hard even unreachable to build an ontology with all of the concepts and vocabularies. Research on Semantic Annotation, however, requires the support of ontology. So our goal should be narrowed to a domain. We should research Semantic Annotation for domain documents. There are many organizations and individuals in our country and abroad who are engaging in research on Semantic Annotation, and has produced some Semantic Annotation algorithms, such as LP2 algorithm, IASA algorithm, C-PANKOW algorithm and so on. However, most of these annotation methods are proposed towards western documents, and usually need process of semantic disambiguation. They are problematic more or less when applied to Chinese documents of specific domain.This paper analyses the characteristics of domain documents and proposes a method of Semantic Annotation for domain documents based on these characteristics. New instances identification, triple relationships identification and web document's subject description are mainly discussed in this paper. This paper proposes the idea of new instances identification with pre-defined rules and the method of OCRNIP (Only Check the Relations between Nearby Instances and properties). From the point of practical application, this paper researches the method of web document's subject description. Finally, SAMDD is applied to the domain of corn and DDSAS, a prototype system for Semantic Annotation is developed. Annotation results are also analyzed in this paper.Domain documents usually have some typical characteristics, therefore, Semantic Annotation method suitable for domain documents can be given based on these features. After analysis of many domain documents, we get the conclusions that domain documents typically have the following characteristics: domain sentences are abundant, professional terminologies are frequent, few semantic ambiguities, the use of domain vocabulary is often based on some certain pattern. According to this, we can get that semantic disambiguation can be omitted when annotating semantics for domain documents and rules which can help identifying some vocabularies to be annotated can be got from patterns domain vocabularies use. Base on the analysis above, Semantic Annotation method for domain documents (SAMDD) is proposed in this paper. The main idea is instances and relationships identification based on predefined rules. Method SAMDD has eight steps: (1)removal of a variety of html tags, extracts contents of web documents, (2)generates user-defined dictionary with ontology, (3)Chinese word segmentation under the help of user-defined dictionary, (4)recognition of concepts, properties and instances, (5)identification of new instances based on predefined rules, (6)user-defined dictionary's update, (7)triple relationships match and RDF documents'generation, (8)subject description for document.Because the usage of domain vocabularies are always with some certain patterns, we can construct predefined rule set by extracting these patterns to help identifying new instances. The new instances recognized are not only annotated in documents but also added into user-defined dictionary for the purpose of riching vocabularies for dictionary and saving time for new instances identification next time.This paper proposes method OCRNIP (Only Check the Relations between Nearby Instances and properties) at issue of match of'resource-property-property values'triple relationship. In this method, the first step is search a instance from document, and then search properties from the end of recognized instance and match the relationship of recognized instance and property until encountering the next instance. The OCRNIP method overcomes the mismatch problem of method of resources and properties matching each other.In order to express the subjects of web documents more accurately and more intuitively this paper proposes a web document's subject description method SDMWD. This method define a weight value for each vocabulary appeared in RDF file and initializes each value with corresponding frequency of the vocabulary in the document, and then expands vocabularies those can render up subjects of web documents using relationships between vocabularies defined in ontology and updates each vocabulary's weight value with relationships between vocabularies. Finally, this method sequences the vocabularies by weight values in order to express the degree each vocabulary represents document's subject.In order to check the efficiency of SAMDD, a corn ontology is constructed and the Semantic Annotation prototype system DDSAS is developed with the application of SAMDD to corn domain. And 50 documents from "China Corn Net" have been annotated with DDSAS. This paper use precision, recall and F1-measure to assess the annotation results from two aspects, including new instances identification and triple relationships match. Experiment of SAMDD compared with C-PANKOW shows that precision, recall and F1-measure of new instances identification of SAMDD are separately 70.00%, 44.49% and 54.40%, better than C-PANKOW's. The result of assessment to triple relationships identification shows that the precision is 94.12%, the recall rate is 63.29%, F1-measure is 75.69%,which can satisfy real needs.SAMDD proposed in this paper can implement the annotation of RDF triples and web document's subject description. Therefor, retrieval system can sort the returned results based on the vocabularies and their weights annotated in RDF documents, which greatly simplifies the complexity of results'sequencing. However, there are still some problems need to be improved. Predefined rule set in this paper is not complete yet, because of limited documents referenced, we still need analysis more documents and find more rules in the future. In addition, the instances and properties defined in ontology are not rich enough and need to be updated constantly, but people's adding information into ontology with hands is clearly not an ideal choice. Expanding knowledge of ontology in the process of Semantic Annotation with the help of "Ontology Learning" is the goal of this paper in the future.
Keywords/Search Tags:Domain Ontology, Predefined Rules, Semantic Annotation, Subject Description
PDF Full Text Request
Related items