Font Size: a A A

Information Extraction Of Chinese Biodiversity Document Based On Machine Learning

Posted on:2012-03-16Degree:MasterType:Thesis
Country:ChinaCandidate:F LiFull Text:PDF
GTID:2178330335465095Subject:Information Science
Abstract/Summary:PDF Full Text Request
The core of information extraction is to identify and extract the data from documents users are interested in, and then to present with a more structured form which can facilitate the query and use of data. In recent years, many researchers both at home and abroad started the study of information extraction in various areas and achieved some success.Increasingly serious environmental problems have put forward higher requirements for ecological and biological research, so the study selected biological diversity as the research field. Species description is the starting point of biology and ecology. The relevant documentation should be the primary object to information organization and use. However, Taxonomic descriptions are often described using natural language, lack of consistency, which is difficult to use effectively. Several research institutions and researchers try to change the format of the text into a new digital format (XML or RDF) in order to improve the base of biological and ecological research. Cui etc. designed and developed a system called MARTT that achieves a good mark effect. The leading words algorithm included in this system outperforms other two general learning algorithms(support vector machines and Naive Bayes)in precision and recall.Through in-depth research of System Principle and self-built machine learning algorithm of MARTT, a semantic annotation system have been implemented for Chinese biodiversity documents using taxonomic species description in Chinese Flora as data set. There are mainly four parts in this paper:(1) The acquisition of datasets and it's XML tagging. This paper designed an XML tagging structure for description of plant taxonomy, and converted the pdf-formatted collection to XML format by tagging.(2) The selection of Chinese word segmentation software. This paper choosed the most suitable participle software by comparing the effects of different Chinese word segmentation software.(3) The construction of marking up algorithm. This paper designed and implemented machine learning algorithms for Chinese plant taxonomy document to achieve semantic annotation.(4) The building of SVM Platform for Comparison. This paper used support vector machine algorithm to classify the test documents by LIBSVM software package.(5) The assessment of annotation effect. This paper divided data collection into training set and test set, and then learned the tagging rules from the training set for marking up test set. The result of annotation was assessed by tagging accuracy. Assessment shows that the system basically completed the annotation of main structure of the document, but the annotation effect of few elements needs to be improved. The tagging accuracy is overall better than that of SVM text categorization system.The study has made use of learning-based semantic annotation in plant taxonomy document, which is very meaningful. Firstly, the selection of "Flora of China" as the source of data set has a strong practical significance and potential value. Secondly, semantic annotation is the foundation and core content of semantic-based information organization and utilization. When semantic annotation is completed, on this basis, structural XML retrieval, federated search and other innovative use of information can be realized. Finally, this study can support the research of biology and ecology to some degree. In addition, it also has very practical value for relevant research in other areas.This paper preliminarily introduces the important parts of the semantic annotation system, and proposed solutions to problems to implement the whole system. However, the increase of data set, the simplification of the work of tagging and the optimization of tagging structure all need further discussion.
Keywords/Search Tags:machine learning, biological diversity, information extraction, Chinese documents
PDF Full Text Request
Related items