Font Size: a A A

Study On Uyghur Named Entity Recognition And Related Problems

Posted on:2019-01-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:H M T M M T MaiFull Text:PDF
GTID:1528305651965769Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Named entity recognition(NER)is a classic problem in Natural Language Processing(NLP).It identifies specific entities in text,including personal names,place names,organization names,proper nouns and so on.Due to the unique lexical and linguistic features of Uyghur named entity recognition,and it is not appropriate to apply the technique suitable for English and Chinese directly.At present,there is no publicly available Uyghur named entity tagged corpus.This paper constructs a corpus of Uyghur named entity by manual annotation.On the basis of deeply analyzing the grammatical and semantic features of Uyghur named entity and in view of the excellent performance in sequence labelling task,we first use CRFs model to study Uyghur named entity recognition.On the design of feature templates,word,syllable,POS tagging and distributed vector representation are utilized to analyze their influence on NER.Secondly,we use deep learning technology to further study Uyghur NER.We utilize character embedding and syllable embedding to improve the system performance.Finally,we apply named entity recognition results to propose a cross-language NER translation pairs automatic extraction method based on word vector.The main work are as follows:1.Uyghur named entity tagging Corpus Construction: We use the existing bilingual resources and Chinese NER results to construct a Uyghur named entity corpus(UNEC),including person name tagged corpus,location name tagged corpus,organization name tagged corpus and the integrated corpus of person name,place name and organization name.These work to fill the gaps in the current lack of named entity tagged Uyghur corpus and provide open data resources for Uyghur NLP researches.2.In Uyghur Part-of-Speech(POS)tagging,we use bidirectional long short-term memory neural network with CRF layer(BI-LSTM-CRF)to study Uyghur POS tagging and propose a method which combines character embedding,word embedding,syllable features and suffix features to further improve tagging performance.We construct a fast and effective POS tagging system whose performance has exceeded that of all known methods in the comparative experiment.3.A Uyghur named entity recognition method based on CRFs and unsupervised feature extraction is proposed;a syllable feature and similar word feature extraction method is put forward,then the efficiency of Uyghur NER is improved.The proposed syllable feature can almost replace stem and affix features,the effect of similar word feature which is extracted from unlabeled large-scale corpora to obtain the semantic and syntactic information of words,almost reaches the same recognition efficiency comparing to lexical features,even superior to morphological and dictionary features in some recognition tasks;the proposed feature extraction method can greatly reduce the cost of engineered feature creation,and improve the performance of Uyghur named entity recognition.4.Based on the feature that there are more transliterated named entities in Uyghur language and its syllables is relatively special,we propose a Syllable-Embedding for BI-LSTMCRF model and perform a comprehensive study of Uyghur NER based on neural network,verifying the syllable-based word representation and its effectiveness.Furthermore,we study the impact of different word representations on Uyghur NER in deep learning method and reduce the shortage of data sparseness,unknown words tagging and artificial feature construction problems in Uyghur NER.5.Cross-Language named entity translation pairs extraction method based on bilingual word vector and NER: On the basis of the recognition results of the Uyghur NER,we propose a multilingual named entity equivalent pairs extraction method based on the word vector.After conducting NER separately for bilingual aligned sentences,we merge bilingual sentences together to train bilingual word vectors,and then extract equivalent entity translation pairs using different strategies.6.Based on the research results achieved in this paper,a web service platform for Uyghur natural language processing is constructed.The main services provided include Uyghur POS tagging(the processing depth can be selected 15,25,64 tags set annotation),named entity recognition,tokenization,syllabification and sentence boundary detection,etc.
Keywords/Search Tags:Uyghur Language, Named Entity Recognition, Neural Network, Extraction of Named Entity Translation Equivalents, POS Tagging, Unsupervised Feature Extraction, Syllable-Embedding, Character-Embedding
PDF Full Text Request
Related items