Font Size: a A A

A Research On Term Entity Linking In Scientific And Technical Report Based On Multi-knowledge Base

Posted on:2018-10-10Degree:MasterType:Thesis
Country:ChinaCandidate:G Q ChenFull Text:PDF
GTID:2348330518475828Subject:Information Science
Abstract/Summary:PDF Full Text Request
Scientific and technical report as a kind of important document resources, its in-depth mining and analysis have important value and significance. However,the current researches on scientific and technical report still remain in its basic concepts,attributes of the definition, and the construction of scientific and technical. There is little research about depth excavation and analysis of the content of scientific and technical report. Scientific and technical reports contain a large number of professional terms entities, these professional term entities are usually the research subjects of the scientific and technical reports, which represent the development of China's science and technology status and future trends. Therefore, the mining analysis of the content of scientific and technical reports, identifying the professional term entities is of great significance to promote science and technology innovation and popularization of science and technology. As a key technology of natural language processing, entity recognition technology can automatically identify entities with specific meanings such as names,locations,and organization names in the text,and it is possible to automatically identify professional terms entities by expanding application of entity recognition technology. In this paper,I took the scientific and technical report as the research object. Firstly,using new word discovery technology to find the potential term new words which are not registered in the scientific and technological report. Secondly, using the web crawler,database and other information technologies to build professional terms knowledge base for supporting the professional term entity identification and linking. Finally, I use the Stanford NER entity recognition framework that based on the conditional random field to realize the automatic recognition of the technical term entity in the scientific and technical report and link with the knowledge base to eliminate ambiguity. The main research work is as follows:(1) According to the existing problems of Chinese word segmentation and the characteristics of the terms entities of scientific and technical report, I carried out a new word discovery method based on part of speech combination. On the basis of using the existing word segmentation tools to do the word segmentation and part of speech tagging,and extract the word string according to the rules, and then determine the new words according to the support of the word string and the internal and external features such as the word length and mutual information. The new words of the technical terms are effectively found,which,to a certain extent, improves the accuracy of the Chinese word segment, and lays the foundation for the recognition of the term entity.(2) Build a professional term knowledge base. Entity recognition requires a large number of corpus as support, and through training corpus to extract the physical characteristics, to achieve the entity's automatic identification. Due to the lack of open scientific and technical report terminology, this paper designs and constructs the term knowledge base using information technology such as web crawler, database and so on,and the data source provided by China terminology.(3) Discuss the mainstream methods of entity recognition, and choose the mature Stanford NER open source entity recognition framework which based on CRF model.Through the training of the term entity model, it realizes the automatic recognition of the entity of scientific and technological report,and finally combine the multi-knowledge base with semantics Negotiation of terminology entities to eliminate ambiguity.(4) Select the scientific and technical report published by the National Science and Technology Report Service System as the experimental data, design and develop the entity link prototype system based on multi-knowledge base. The system mainly integrates the data preprocessing, the new word discovery, the entity recognition and the entity link function, realizes the automatic recognition and disambiguation of terms entities, and verifies the correctness and validity of the method.
Keywords/Search Tags:scientific and technical report, technical term, entity link, knowledge base construction, semantic similarity
PDF Full Text Request
Related items