Font Size: a A A

Research On Rule-based Extraction Of Mongolian Character Attributes

Posted on:2019-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:M J HuFull Text:PDF
GTID:2428330563457215Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the Internet era of information explosion,the Internet is full of massive information and data,and most of them are in the form of semi-structured text or free text.Users have higher and higher requirements on the efficiency of information retrieval and the accuracy of the obtained results.Therefore,Information Extraction(IE)technology has been rapidly developed to help users obtain the target content quickly.Although Information Extraction technology has got many practical achievements in the study of Chinese and English,the research on Information Extraction in Mongolian,a minority language,is still in the primary stage.The information extraction involved in this thesis is specifically referred to as entity relation extraction.It aims to extract target information such as character attribute values from large-scale text data,and then save the extraction results in a structured form and provide users with subsequent queries,and this research is also a basic work for building network applications such as knowledge bases or Mongolian character search engines.This thesis uses Mongolian unstructured texts which are crawled from Mongolian news websites to research the extraction of character attributes for highly concerned character by a rule-based approach.The key research work of this paper is as follows:(1)We designed crawler tools and crawled texts from several Mongolian news websites based on their structure and URL characteristics of the located web page.Then we perform multiple pre-processing tasks including Named Entity Recognition which adopting a combination of BLSTM and CRF models.After a series of pre-processing,we got the web text corpora for subsequent analysis.(2)Using manually created trigger word table and rule base to extract the character attribute value contained in the corpus and save them in the form of “character-attribute-attribute value” triplet.The accuracy of the extraction results is practical.It is proved that the rule-based extraction method proposed in this thesis is feasible and effective.(3)Design and implement a Mongolian character attribute extraction system,which provides two functional modules: character attribute extraction and character related attribute information query.
Keywords/Search Tags:Mongolian information extraction, character attributes, Web Crawler, Named Entity Recognition, trigger words, rule-based
PDF Full Text Request
Related items