Font Size: a A A

Research On Semantic Annotation For Domain-Specific Web Pages

Posted on:2012-06-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:T JingFull Text:PDF
GTID:1118330335450234Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The flourish development of web technology has brought about the explosive growth of web resources, which makes World Wide Web become the largest information repository of the world. Though the web provides people with vast amounts of information, it has increasingly exposed a serious problem:information overload, that is, the information is abundant while the means of acquiring information is relatively scarce, which makes it difficult for people to obtain valid knowledge. Facing this growing trouble, people try to use web information retrieval technology (for example search engines) and automated agents technology based on information extraction to tackle this problem. However, the lack of machine-understandable semantics in the web content makes it difficult for these softwares to be highly efficient. The vision of the semantic web is to make the web content machine-understandable. The achievement of this vision will enable the machine to make full use of the semantic information.in the web pages and meet the user's demands for knowledge effectively. Realizing the vision of the semantic web requires a lot of web contents which contain semantic metadata, but the existing web pages have little of them. To add semantic metadata to web pages belongs to the researches on semantic annotation. These researches on semantic annotation will be advantageous in narrowing the gap between the current web and the semantic web and realizing the vision of semantic web as early as possible, in improving the performance of the search engines and bridging the knowledge gap between the users and the search engines during the search and also in decreasing the developing cost of the automated agents and increasing the robustness and intelligence of the automated agents.The thesis is financially supported by the Major Research Program of the National Natural Science Foundation of China under grant No.60496321. Based on the deep analysis of related research and existing methods, this thesis has used many computer science theories and methods comprehensively, such as semantic web, ontology engineering, natural language processing, machine learning and web mining etc., has performed researches on semantic annotation for domain-specific web pages. The results have been used in the prototype system—CRAB.The main research results and technical contributions of this thesis are listed as follows: The thesis has introduced and analyzed the current state of art of semantic annotation research and its related techniques. By comparing the situation of the current web and the vision of the semantic web, the thesis has pointed out the urgency and importance of the research on semantic annotation. Based on the analysis and definition of the concept of semantics, annotation and semantic annotation, the thesis has introduced the category and the development of annotation and has reviewed the work related to semantic annotaion. In addition, the study of ontology and ontology engineering closely related to semantic annotation are also introduced in-depth. All the above are the groundworks of the further research works.Based on the existing ontology engineering methods, this thesis has presented a four-phase method for constructing the domain ontology, which is driven by research requirements and supports each research group to work in a decentralized environment. The building process is divided into four phases:1. building together. 2. local adaptation,3. analysis and revise.4. release and update. Except the first phase, the last three phases are performed in iterative cycles. After each cycle, a newer version of the domain ontology is released and the prototype of the domain ontology is evloved. This method fits to cope with the scenarios where users' needs change frequently and facilitates the rapid development of ontologies.HowNet is an important knowledge base of common sense. However, the lack of the programming interface of HowNet (free edition) makes it hard for the researchers to use it efficiently: Hence, this thesis has given a technique solution to obtain the interface. It is a valuable exploration into the reverse engineering of binary codes. By analysing the assembly codes statically and tracing them dynamically, the thesis has extracted the function interface of Hownet successfully and has generated the header files and libraries according to the function calling conventions. The work has the following two contributions:the first is that it gives the programing interface of the HowNet software and facilitates the research related to Hownet. And the second is that it is a good referential example of making full use of various legacy binary codes in the research and especially of reusing the binary codes without the instruction of the programming interface.Noting the similarity between the two forms of knowledge representation:the natural language sentences and the RDF representaions, the thesis has proposed a methodology framework for semantic annotation of Chinese web pages, which is guided by domain ontology and employs the statistical method and the natural language processing (NLP) technology. The framework comprises three phases:the data preparation phase, the identification phase and the grouping phase.In the data preparation phase, a focused crawler is employed to build the repository of the domain-specific web pages. The domain lexicon is constructed by the feature selection technique, which is used to obtain the high-frequency words relevant to the domain from the repository. After the types of the words (of the domain lexicon) are labeled which are correspondent to the concepts or properties of the domain ontology manually, the type tagging gazetteer is generated. In the identification phase, the thesis has proposed an explicit property type tagging algorithm (EPTT). The tagging type is divided into two kinds:ontology type and general type. The algorithm uses both the rules and the gazetteers to recognize the instances and properties in the text. Compared with the normal methods of named entity recognization, this method makes the further processing easier by tagging the words of property type explicitly. In the grouping phase, the thesis has grouped the words of the sentences by employing the dependency relationship, has proposed the concepts of dependency tree and dependency forest and has given two algorithms:the relation extraction algorithm based on the dependency tree (DTRE) and the relation extraction algorithm based on the dependency forest (DFRE). The DTRE algorithm uses natural language processing technique (NLP) to parse a given sentence and constructs the dependency tree based on the dependency relationship of the words which have been got firstly, and then the Grammar Relation Triples (grt, for abbrivation) can be generated. By combining the domain ontology and the type tagging results, the algorithm validates the grts. Each valid grts are transfered into a knowledge triple (RDF statement) which is correspondent to the domain ontology. Thus, the mapping from the natural language sentence to RDF representation is done. DFRE algorithm is an improvement of the DTRE, which is designed mainly to tackling the long Chinese sentences. The method decomposes a long sentence into clauses, and then constructs the dependency tree of each clause respectively. After unioning all the dependency trees into a dependency forest, the DTRE algorithm is called to accomplish the relation extraction. The experimental results show that compared with semantic annotation method based on the grammatical relationship of subject-verb-object, both of the two methods are significantly more effective. In addition, an active learning idea based on the influence formula has been presented to increase the performance of the annotation. The influence formula has been defined based on two respects:one is the diffculty of annotating the triple and the other is the influence over the other triples of the collection when this triple is annotated.Noting that some sentence patterns occurs frequently in the domain articles, the thesis has presented a method of semantic annotation based on mining the frequent feature patterns of sentences. According to the theory of mining sequential patterns, the thesis has given the definitions of the feature itemset, the feature item and the feature sequence, which are used in mining the frequent feature patterns of sentences. By defining the feature items as word types and defining the feature sequence as type identifier strings, the semantic abstraction of the original sentences can be 吉林大学博士学位论文attained. After giving the above definitions, a methodology framework has been proposed, which is composed of three phases:the data preprocessing phase, the pattern mining phase and the rule processing phase.In the data preprocessing phase, the thesis has extracted the words of property type in the type tagging gazetteer to build the feature words list firstly. Based on the defined formula for caculating the feature strengths of the sentences, the feature sentences whose feature strengths are higher than the predefined threshold are extracted from the whole sentence space. After getting the feature sentences, the corresponding feature sequences database can be constructed by employing the feature sequence generation algorithm.In the pattern mining phase, the feature sequence database has been processed by the proposed sequential pattern mining algorithm based on suffix array, and the frequent feature patterns have been obtained. This mining algorithm makes full use of the advantage of suffix array in processing the long sequences. The nuclear concept is to transfer caculating the supports of the feature patterns in the feature sequence database into caculating the document frequencies of the feature patterns in the various sequence documents.In the rule processing phase, the thesis has written the annotation rules according to the mined feature patterns and has applied them to semantic annotation. The experimental results show this method can tackle some domain specific sentences effectively and avoid the errors caused by the parser. Thus, the precison of the annotation has been improved. By combining this method and the DFRE method, the performance of semantic annotation has been significantly improved.
Keywords/Search Tags:Semantic Web, Ontology, Natural Language Processing, Dependency Relationship, Sequential Pattern, Suffix Array, Reverse Engineering
PDF Full Text Request
Related items