Font Size: a A A

Automatic Knowledge Extraction From The Chinese Natural Language Web Documents And Knowledge Consolidation

Posted on:2009-10-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:H Y CheFull Text:PDF
GTID:1118360272476437Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The Web is the largest and richest information repository available today. But most of the information on the Web has only layout related syntax labels and is only human-readable. The computers can not search and utilize the information on the Web automatically and efficiently on behalf of people. The Semantic Web is an extension of the Web. It provides semantic meta data to the information on the Web and enables the computers"understand"and process the information automatically. One of the biggest challenges for the realization of the Semantic Web is the available of the semantic content, which can be solved by adding semantic annotation to the information already existed on the Web and generating new information associated with the semantic annotation directly.Automatic knowledge extraction method can recognize and extract the factual knowledge matching the ontology from the Web documents automatically. These factual knowledge can not only be used to implement knowledge-based services, such as building semantic-based intelligent search engine which can provide users convenient and correct information retrieval services, but also provide necessary semantic content to enable the realization of the Semantic Web.Most of the existing knowledge extraction methods only deal with the English Web documents. With the rapid increase of the amount of the Chinese Web users and the Chinese Web resources, researches on the automatic knowledge extraction from the Chinese Web documents have a good prospect. But due to the characteristics of Chinese, it is very difficult to analyze and understand the Chinese natural language documents efficiently and the existing knowledge extraction methods for English can not be used directly for Chinese. So, exploiting the method which can extract knowledge from the Chinese natural language documents automatically is challenging and meaningful.Based on the analysis of the related research and existing methods, this thesis performed researches on the domain ontology definition method, automatic knowledge extraction from the Chinese natural language Web documents, knowledge consolidation and semantic-based intelligent search engine et al. The main results obtained by this thesis are listed as follows:(1) The thesis has introduced and analyzed the current state of the art in the fields of knowledge extraction and knowledge consolidation. This thesis has classified the existing knowledge extraction methods according to the types of the documents these methods target at and the methods'automation degree, analyzed the unique characteristics of Chinese and pointed out the difficulties in analyzing and understanding Chinese, and summarized the related problems that should be solved in the fields of the knowledge extraction and knowledge consolidation.(2) The thesis has presented a domain ontology definition method which can depict the N-ary relations. After thoroughly analyzing the content character of the Chinese natural language Web documents, this thesis has pointed out that the Chinese natural language Web documents contain not only simple factual knowledge about the binary relations between two entities or entities and values, but also a lot of complex factual knowledge about N-ary relations among multiple entities and values. However, the existing ontology definition methods do not provide a systematic definition method for such kind of knowledge and the existing knowledge extraction methods do not extract such complex factual knowledge. To solve this problem, the thesis has presented a systematic domain ontology definition method, which advocates the Aggregated Knowledge Concepts to encapsulate such N-ary relations and emphasizes that the ontology concepts should be assigned appropriate property restrictions. This domain ontology definition method can not only characterize the domain knowledge comprehensively, but also provide powerful support for the automatic knowledge extraction and knowledge consolidation, such as recognizing the properties, instances and checking the knowledge validity and integrity in the process of knowledge extraction, and getting rid of the contradiction, redundancy and merging the knowledge in the process of knowledge consolidation.(3) The thesis has presented an automatic knowledge extraction method targeted at the Chinese natural language Web documents. The knowledge extraction process consists of three steps: knowledge triple elements recognition, knowledge triple composition and knowledge cleaning.After analyzing and summarizing the existing methods for recognizing the triple elements, this thesis has pointed out that most of the existing methods have to take advantage of large-scale linguistics databases or synonym tables to solve this problem or can only recognize those elements that directly correspond to the words in the texts. However, the existed general Chinese linguistics databases can not provide accurate interpretations for the domain-specific words and the construction of the large-scale linguistics databases or synonym tables is labor intensive and time consuming and thus unrealistic. At the same time, the elements that constitute the knowledge implied in the content of the documents may have no direct correspondence to the words literally. To solve this problem, the thesis has presented an ontology theme-based property recognition method and an ontology property restriction-based triple elements recognition method. Compared with the existing methods, these methods have two main advantages. Firstly, they do not need large-scale linguistics databases or synonym tables. Secondly, they can infer the elements that are implied in the content on the basis of the elements existed in the content explicitly and the domain ontology. The ontology theme-based method fits for the content with the obvious description themes and the ontology property restriction-based method fits for the normal Chinese natural language Web documents.After analyzing the problems about the knowledge triple composition targeted at the Chinese natural language Web documents, this thesis has shown that it is very difficult to group the recognized ontology resources into correct triples that represent the document's meaning correctly. This thesis has presented a heuristic rules-based knowledge triple composition method and a syntactic analysis-based knowledge triple composition method. The syntactic analysis-based method searches for the helpful syntactic relations among the words on the basis of the sentence's syntactic structure and the dependency relations between words. This method also takes advantage of the heuristic rules to solve the omission of the sentence's components and the reference resolution. Experiments have shown that this method can gain a better precision rate than the heuristic rules-based method and is suitable for the normal Chinese natural language documents.Due to the imperfection of the triple element recognition method and the triple composition method and the complex nature of the Web information, the factual knowledge extracted from the Web documents initially may be invalid or incomplete. This thesis has presented an ontology property restriction-based knowledge cleaning method. This method can judge and delete the invalid and incomplete factual knowledge that do not follow the domain ontology and ensure the quality of the knowledge in the knowledge base (KB) and the quality of the services built on the knowledge.Experiments have shown that this automatic knowledge extraction method works well for the Chinese natural language Web documents even without the support of large-scale linguistics databases or synonym tables and can deal with the complex aggregated knowledge about the N-ary relations in the documents. The precision rate, recall rate and F1 measure is 87.26%, 58.82% and 70.27% respectively, better than the other related works. More importantly, this method has good portability and can be applied in different domains as long as the corresponding domain ontology is provided.(4) This thesis has performed researches on the knowledge consolidation related methods. Knowledge consolidation comprises the identification and unification of the equivalent instances, recognition and treatment of the redundant and contradictory knowledge. This thesis has presented an ontology property restriction-based knowledge consolidation method. This method can determine the key property set of the concepts according to the domain ontology and can identify the equivalent instances by comparing the values of all the key properties. This equivalent instances recognition method is simple and intuitive and suits for the normal domain ontology. The knowledge consolidation method also provides the definitions, recognition and processing method for the redundant and contradictory knowledge and can judge the semantically redundant or contradictory knowledge on the basis of the equivalent instances. This knowledge consolidation method can ensure the consistency of the KB after merging with the new factual knowledge.(5) This thesis has designed and developed a semantic-based intelligent search engine system—CRAB. After analyzing the main shortcomings of the traditional search engines, this thesis has pointed out that the key words-based querying method and the query result composed of a list of Web pages links can not satisfy the user's need for querying the information more correctly and conveniently. Based on the automatic knowledge extraction and knowledge consolidation method presented, this thesis has advocated and implemented a semantic-based intelligent search engine system—CRAB. Compared with the traditional search engine, this system can extract automatically the factual knowledge that matches the domain ontology from the domain related Chinese Web documents and merging them into the domain ontology KB; can allow the users to input their query requests in a natural language-like manner; can search the KB for the factual knowledge that is semantically related to the users query request; and can generate a report containing the querying result directly and is composed of both associated texts and graphs. This system enables the users to acquire the thorough, direct, correct and visual information conveniently. At the same time, the success of this system has demonstrated the effectiveness of the related methods. The research results of this thesis including the automatic knowledge extraction from the Chinese natural language Web documents, knowledge consolidation and the semantic-based intelligent search engine will enrich and push forward the studies of the related areas in both theoretical and technological aspects.
Keywords/Search Tags:knowledge extraction, knowledge consolidation, Chinese natural language Web documents, ontology, Semantic Web, semantic-based intelligent search engine
PDF Full Text Request
Related items