Research On Information Extraction Technology Of Tibetan Culture

Posted on:2017-03-29

Degree:Master

Type:Thesis

Country:China

Candidate:Y H Feng

Full Text:PDF

GTID:2278330485455837

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Information extraction is getting particular fact informaiton, and storing the structured data to a database for querying and further using. The main tasks of information extraction are entity recognition, relation extraction and event extraction. Apart from the study of general information extraction theory and technology, people also study on some specific domain information extraction. This paper studies technologies of Tibetan culture domain information extraction. Three parts of research content are Tibetan culture domain entity extraction, relation extraction and event extraction.Tibetan culture domain entity extraction mainly includes two parts. Firstly, on the basis of TextRank which is an unsupervised learning algorithm, a hybrid TextRank based on node weight and edge weight is proposed. The experiment is tested on Tibetan culture domain corpus of 780KB. The precision of the top 100 words reaches 81%. Person name is very important for relation extraction and event extraction, but the existed Chinese name recognition system doesn’t suit the needs of translation of Tibetan name recognition. For this reason, this paper proposes the translation of Tibetan name based on Tibetan culture domain knowledge. The experimental result of 1.9M texts shows that the experimental F1-measure increases from 40.08% to 87.92%.According to the characteristics of Tibetan culture domain, job change relation, birthplace relation and graduate relation are extracted based on pattern matching. And a semi-supervised machine learning algorithm-Bootstrapping and a thesaurus extension method is used for acquiring relation verbs and patterns. The experimental result of 1.9M texts shows that the experimental F1-measure of three relations is 81.37%,80.56% and 81.32%. In addition, on the basis of wikipedia, some relations of Tibetan religion domain are drawed using Gephi software, such as relation of organizational affiliation, death and religious sect.According to the characteristics of Tibetan culture domain event, meeting event and Tibetan feastvial event are extracted based on pattern matching. The experimental are tested on meeting event corpus of 647KB, Tibetan feastvial event of 320KB, and F1-measure are 85.11%,84.03%.

Keywords/Search Tags:

Tibetan culture domain information extraction, translation of Tibetan name recogniton, hybrid TextRank, Bootstrapping, pattern matching

PDF Full Text Request

Related items

1	Research On Some Key Technologies Of Tibetan Machine Translation Based On Tree To String
2	Tibet's Present Status Of National Culture Class Ttelevision Programs
3	Study On Tibetan Information Retrieval&Search Results Clustering And System Implementation
4	Design And Implementation Of Sanskrit Tibetan Input System On Android Platform
5	The Research Of Phrase Extraction Technology For Tibetan And Chinese Statistical Machine Translation
6	Southwest Jiaotong University Researsh And Realization Of Tibetan Encoding Recognition And Converdion
7	Research On Tibetan Entity Relationship Extraction Based On Remote Supervision And Attention Mechanism
8	Research On Tibetan-Chinese Machine Translation Under The Condition Of Sparse Resources
9	Research On Real-time Monitoring Technology For Tibetan Text Based On HTTP Protocol
10	Research On Automatic Disambiguation Method Of Tibetan Word Meaning Based On Chinese And Tibetan Parallel Corpus