Font Size: a A A

Research On Named Entity Recognition And Disambiguation Based On Network Semantic Resource

Posted on:2017-11-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:J ZhouFull Text:PDF
GTID:1368330596459983Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the coming of big data era,the transform from information into knowledge has become the important trend of information management.As the common tools of knowledge storage and management,knowledge bases tremendously improve intelligent processing ability of computer.Nowadays,the technology of automatic entity knowledge base construction,which utilizes abundant resources on the Internet as information source of knowledge acquisition,has received much attention.Named entity recognition and disambiguation are key technologies in this application,which are used to identify entity mentions appeared in network resources and find the entities in the real world referred by the mention.However,in these applications based on network researches,new requirements are presented for named entity recognition and disambiguation,including:(1)more types of named entities,which are no longer limited to traditional common types,are adopted and the problem of insufficient standard corpora need be solved;(2)for vast volumes of network data in various forms,the performance and efficiency of methods need be improved further.In recent years,the emergence and development of open semantic resource in the network has provided much richer semantic information for the technology of named entity recognition and disambiguation.By using of these information extensively,the shortcoming of traditional methods that are applied to entity knowledge base construction can be solved or reliered effectively.In this thesis,the technology of named entity recognition and disambiguation based on network semantic resource has been studied,including entity classification,named entity recognition,person name disambiguation,and entity linking disambiguation.The main contributions are listed as follows:(1)The method of fine-grained entity classification is researched.Current researches on entity classification focus on English language.Because there are great differences among language characteristics and existed resources in different languages,a lot of characteristics in English language cannot be directly used in other languages.To solve this problem,a method of classifying articles in Chinese Wikipedia with fine-grained named entity types is proposed.This method considers the characteristics of Chinese named entity and multi-faceted information in Chinese Wikipedia,and constructs four feature sets,including article content feature,structured feature,category feature and article title feature;Then,different feature selection methods are designed for each feature,and different features with a vector space using different strategies are fused;Finally,entity classification of Chinese Wikipedia articles is realized by using SVM classification algorithm.Experimental results show that this method can improve the performance of entity classification effectively.(2)The method of open-domain named entity recognition is studied.The entity types would be extended over time while corresponding standard corpora are not sufficient to train recognition models,and the data that need to be processed usually involve several different domains.To solve this problem,a method of open-domain named entity recognition based on automatic corpora generation is presented.By taking full use of content and structure information in Chinese Wikipedia,a large-scale Chinese named entity recognition(NER)corpora containing nearly 2.3 million sentences are built;Then,a tagged corpus selection approach is adopted to select tagged sentences based on the domain of testing data;Finally,CRFs is used to train the model to recognize named entities in the text.Experimental results show that generated NER corpora have good quality and satisfy the demand of tagged corpora for open-domain named entity recognition.Moreover,this method can solve the problem of domain transfer by selecting tagged corpora,and the performance of named entity recognition is also improved.(3)The method of global person name clustering disambiguation is researched.In documents,there are some important features that are excellent indicators for real identities of name mentions,but existing methods cannot realize effective identification of these features to disambiguate person names.To solve this problem,a method of Chinese person name disambiguation based on two-stage clustering is put forward.Firstly,three kinds of core evidences(including direct social relation,indirect social relation and common description prefix)are extracted to recognize document-pairs referring to the same person entity and realize initial clustering of person names with high precision;Then,by using the result of initial clustering as new initial input,the statistical properties of multi-documents are utilized to evaluate each feature,and a double-vector representation of clusters is constructured.Based on the processes above,the final clustering of person names is generated,and the recall of clustering is improved effectively.Experiments conducted on the dataset of CLP2010 Chinese person names disambiguation show that this method owns good performance in person name clustering disambiguation.(4)The method of incremental clustering for person name disambiguation is studied.The most of existing methods are more focused on global clustering for person name disambiguation,but they are usually inefficient while processing a large-scale data and cannot support incremental clustering.To solve this problem,an incremental clustering method based on key evidence and E~2LSH for person name disambiguation is presented.Firstly,global clustering method is used to cluster initial document set,which can reduce the size of documents and ensure the performance of clustering;Then,key evidence and E~2LSH algorithm are adopted to generate candidate document set,which can reduce the size of documents that needed to compute the similarity significantly and improve the efficiency;Finally,the group of new documents are identified and incremental clustering for person name disambiguation are realized.Experimental results show that this method can improve clustering efficiency for person name disambiguation and achieve good clustering performance.(5)The method of entity linking disambiguation is explored.The role of different evidences is different in the process of entity disambiguation.To utilize different evidences respectively,weakly-supervised entity linking method based on evidence model is proposed.Firstly,an entity representation is designed based on three kinds of evidences,including context,social relation and entity name,and structured representation for target entities is realized;Then,a quantitative evaluation for these evidences is designed to measure their disambiguation abilities.Through this quantitative evaluation important evidences'role in the task of entity disambiguation is strengthened.Finally,the overall relevance between the mentions and candidate entities is computed,and entity linking disambiguation is realized.Experimental results show that this method can achieve great performance,reduce the dependency on training data effectively,and own good adaptability.(6)Construction and application of entity knowledge base are realized.By using the technology of entity recognition and disambiguation above,data model of entity knowledge base is designed,and Chinese entity knowledge base is constructed automatically.In the aspect of data structure,storage and management structure are proposed based on three layer data model,and different types of knowledge are extracted according to the characteristics of online resources.Furthermore,precise search for entity target and recommendation of related entities are realized.
Keywords/Search Tags:Named Entity Recognition, Named Entity Disambiguation, Clustering Disambiguation of Person Name, Entity Linking Disambiguation, Entity Classification, Named Entity Recognition Corpus, Semantic Resource
PDF Full Text Request
Related items