Font Size: a A A

Research On Information Extraction Technology Of Tibetan Culture

Posted on:2017-03-29Degree:MasterType:Thesis
Country:ChinaCandidate:Y H FengFull Text:PDF
GTID:2278330485455837Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Information extraction is getting particular fact informaiton, and storing the structured data to a database for querying and further using. The main tasks of information extraction are entity recognition, relation extraction and event extraction. Apart from the study of general information extraction theory and technology, people also study on some specific domain information extraction. This paper studies technologies of Tibetan culture domain information extraction. Three parts of research content are Tibetan culture domain entity extraction, relation extraction and event extraction.Tibetan culture domain entity extraction mainly includes two parts. Firstly, on the basis of TextRank which is an unsupervised learning algorithm, a hybrid TextRank based on node weight and edge weight is proposed. The experiment is tested on Tibetan culture domain corpus of 780KB. The precision of the top 100 words reaches 81%. Person name is very important for relation extraction and event extraction, but the existed Chinese name recognition system doesn’t suit the needs of translation of Tibetan name recognition. For this reason, this paper proposes the translation of Tibetan name based on Tibetan culture domain knowledge. The experimental result of 1.9M texts shows that the experimental F1-measure increases from 40.08% to 87.92%.According to the characteristics of Tibetan culture domain, job change relation, birthplace relation and graduate relation are extracted based on pattern matching. And a semi-supervised machine learning algorithm-Bootstrapping and a thesaurus extension method is used for acquiring relation verbs and patterns. The experimental result of 1.9M texts shows that the experimental F1-measure of three relations is 81.37%,80.56% and 81.32%. In addition, on the basis of wikipedia, some relations of Tibetan religion domain are drawed using Gephi software, such as relation of organizational affiliation, death and religious sect.According to the characteristics of Tibetan culture domain event, meeting event and Tibetan feastvial event are extracted based on pattern matching. The experimental are tested on meeting event corpus of 647KB, Tibetan feastvial event of 320KB, and F1-measure are 85.11%,84.03%.
Keywords/Search Tags:Tibetan culture domain information extraction, translation of Tibetan name recogniton, hybrid TextRank, Bootstrapping, pattern matching
PDF Full Text Request
Related items