Font Size: a A A

A Research On Chinese Information Extraction Based On Construction Of Domain Ontology

Posted on:2017-10-01Degree:MasterType:Thesis
Country:ChinaCandidate:S S HuangFull Text:PDF
GTID:2348330485972938Subject:Information Science
Abstract/Summary:PDF Full Text Request
The rapid growth and electronization of information promote the sharing of knowledge and the convenience of information acquisition, but the situation also increases the difficulty for people to obtain knowledge. In this context, information extraction technology, as information technology that dedicated to the recognition of specific target information from various forms of information, gets extensive development. Chinese information extraction has more obstruction because of the unique characteristics of Chinese and polysemy.Biodiversity is the foundation of biology and ecology, species is the closest natural element to the organisms, so it can be regarded as the basis of biological diversity. And the mass of plant species makes it as one of the important contents of biological species field. However, the massive plant species description information often exists in the form of text, the situation makes the knowledge of plant species difficult to be identified and used, even to hinder the development of plant species field in the future.In order of the maximum applicability of information extraction scheme, this study designed an information extraction scheme. In this information extraction scheme, the ontology as the support base, layered analysis as the analytic pattern, under the condition of text structure analysis, extract information from description text with rules. Then a complete information extraction framework has been formed. What's more, in the scheme, a framework of building domain ontology has been provided, which constructs ontology with reusing existing ontology according the top-level ontology, and means full use of domain knowledge.The plant species has been chosen as the field to practice, in order to promote the development of knowledge of the field. The information extraction scheme can be divided into the following four tasks.(1) Construction of Domain Ontology------Construct Chinese plant species diversity ontology by reusing the existing ontology (PO) according the top-level ontology(BFO). And design and practice the whole process of constructing new ontology by reusing the existing ontology according the top-level ontology;(2) Generation of Text Set------Analysis the structure of Web page based on DOM tree, obtain the original text block, then filter information for the target description through the calculation of text similarity, in order to get text set to be extracted;(3)Formation of Domain Dictionary and Tagging Set------Form the domain dictionary through the analysis of the domain ontology by calling the function of Jena. And combine the domain dictionary with the text feature and needs of feature extraction, construct the tagging set;(4) Tagging and Extraction------Achieving content annotation by the process of word segmentation, on this foundation, analysis content by layers according to the text structure, then judge the semantic structure of the content, and ultimately represent knowledge in structural form.In practice, the study chose Chinese plant species as the practice domain. In order to support the information extraction task, Chinese plant species diversity ontology has been constructed, in the method of reusing ontology. Then extract structural information from plant species description text by rules based on the ontology. In four groups of experimental data, the average accuracy rate reached 0.89, the average recall rate reached 0.88, and the average F-measure reached 0.88. And the applicability of the information extraction framework for different species has been verified by contrast experiments.Finally achieved the following achievements:(1) Innovation of information extraction method; (2) Construction of the Chinese Plant Species Diversity Ontology;(3) Better practice comprehensive effect than the domestic similar research.
Keywords/Search Tags:Information Extraction, Domain Ontology, plant species diversity description, text processing
PDF Full Text Request
Related items