Font Size: a A A

Research On Part Of Speech Tagging System Of Pre-Qin Classics Oriented To Entity Extraction

Posted on:2020-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y YuanFull Text:PDF
GTID:2518306314995919Subject:Master of Library and Information
Abstract/Summary:PDF Full Text Request
Natural language processing is an important field in artificial intelligence.It not only help people extract the information they need from enormous linguistic data,but also understand the grammatical semantics and respond appropriately.With the gradual progress in the study of language processing in modern Chinese in the past decade,researchers have gradually been able to extract effective entity information from modern Chinese and perform grammatical analysis.However,as a traditional Chinese language,ancient Chinese contains a lot of historical data,and also has the need of intelligent processing.Part of speech is one of the most important attributes of vocabulary and an important link between vocabulary and syntax.It can provide a lot of important information about vocabulary and its context.A good part-of-speech tagging system is the premise of part-of-speech tagging.At present,basic sets of Pre-Qin Chinese POS tagging from Nanjing normal university、POS tagging of Chinese text from Peking university、POS tagging of Chinese from Institute of Computing Technology、modern Chinese POS tagging standard published by the Ministry of Education are authoritative POS tagging.Among them,only the basic sets of Pre-Qin Chinese POS tagging from Nanjing normal university was established for ancient Chinese,while the other three served for modern Chinese text tagging,and the accuracy rate was low in the experimental process.This topic based on the characteristics of Sinological Index Series.With the Pre-Qin Chinese POS tagging from Nanjing normal university as the main part,supplemented by the POS tagging sets of Peking university,the Institute of Computing Technology of Chinese Academy of Sciences and the Ministry of Education,three tagging sets of different sizes are formed.The construction process strictly follows the rules and characteristics of ancient Chinese language,implements the principle of consistent syntactic functions,and systematically takes into account the principle of semantic independence and integrity.The new POS setting and the basic sets of Pre-Qin Chinese POS tagging from Nanjing normal university used to perform the named entity extraction experiment on the random field respectively to compare the POS tagging,which is more applicable for the study of ancient Chinese.The corpus of this article is from the ancient Chinese classics index called Sinological Index Series published by Harvard-Yenching Institute,including"Chunqiujingzhuan"、"Guoyu"、"Lunyu"、"Mengzi"、"Mozi"、"Xinzi"、"Zhuangzi".Conditional random field feature selection and template construction after enter the corpus into computer and POS tagging.According to the text,this article select three characteristics of word form,word length and left and right boundary words.The experiments of named entities recognition on"Zhuzi" and "Chunqiu" based on the combination of different characteristics.Experimental results show that the three POS tagging proposed in this paper are superior to the basic sets of Pre-Qin Chinese POS tagging from Nanjing normal university in terms of entity extraction results in different experiments.The highest harmonic mean values of the two-part corpus reached 80.34%and 83.33%,respectively.The entity extraction effect is best when using the POS tagging2 and adding the combination features of part of speech,word length,left and right boundary words.This accuracy rate and recall rate both improved compared with the basic sets of Pre-Qin Chinese POS tagging from Nanjing normal university.POS tagging2 named as Nanjing Agricultural University Ancient Chinese Part of Speech tagging(NACP).NACP is more suitable for the study of ancient Chinese and used for subsequent semantic knowledge mining.At the end of the paper,extracting and analyzing the incorrectly marked corpus.It mainly due to the names and place names incorrectly recognition and uncommon names and place names are not recognized.The paper also proposes corresponding improvement measures.
Keywords/Search Tags:Digital humanities, Ancient Chinese character information processing, Parts of speech tagging, Named entity extraction
PDF Full Text Request
Related items