Font Size: a A A

Research On Technology And Terminology Recognition Oriented Specific Science Domains

Posted on:2021-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:L L FengFull Text:PDF
GTID:2428330605474773Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years,there are a large amount of public information in the Internet with great application values and the patents,technical reports and news in specific science domains(especially national defense science)contain a lot of valuable scientific and technological information.How to extract these information quickly and effectively is beneficial to na-tional defense science construction.The entities(such as technology and terminology)in specific science domains are the foundation of information extraction and it can boost the downstream tasks,such as science and technology knowledge graphs and relation extraction.Although named entity recognition technology has achived success and has lots of applica-tions in various domains(e.g.,biomedical domain)currently,the technology and terminol-ogy recognition oriented specific science domains,which has no corpus resources at present,is significantly different from general and biomedical domains.Therefore,this thesis focuses on the technology and terminology recognition oriented specific science domains,and the main research contents include the following three aspects:(1)For the lack of corpus resources in specific science domains,this thesis constructs a technology and terminology corpus oriented specific science domains.First,we analyze defense science text characteristics and design annotation guidelines for technology and ter-minology oriented specific science domains from massive internet content for a list of de-fense science emerging technology defined in Wikipedia.Besides,based on the annotation guidelines,we conduct broad-scale corpus annotation process,and we construct a technol-ogy and terminology corpus oriented specific science domains which covers three genres of news,literature(such as papers and patents)and Wikipedia.Next,we conduct quantitative statistics and quality analysis on this corpus,which contains 479 articles,with 24487 sen-tences and 33756 technologies and terminologies.The corpus also achieves a good con-sistency in annotation check.Finally,the comparison of our corpus with other public corpora shows that it can be applied to the research work of technology and terminology recognition oriented specific science domains.(2)For the problem that traditional word feature can hardly represent technology and terminology features,this thesis proposes a method of technology and terminology recogni-tion based on subwords and linguistic features,First,we explore the application of subwords feature in the traditional Bi-LSTM+CRF sequence labeling model.In addition,we incorpo-rate linguistic features suitable for technology and terminology recognition in the model to boost the performance.Experimental results on the annotated dataset show that our approach achieves 71.8%F1 scores,with improvement of 3.04%over the benchmark system,indicat-ing the effectiveness of our approach in recognizing technology and terminology oriented specific science domains.(3)For the further enhancement of feature fusion between subwords and words,this thesis proposes a method of technology and terminology recognition based on subwords graph network.First,we propose three word-subwords interactive graphs to obtain the con-nection between words and their subwords flexibly.Word-subwords Inclusion graph(I-graph)can help words capture the semantic information of each subword.Word-subwords Triangle graph(T-graph)can help words capture the overall semantic information of their subwords.Word-subwords Context graph(C-graph)can obtain contextual information of words and help words capture the semantic information of the nearest contextual subwords.Then,we use the graph attention network to model over the three word-subwords interactive graphs.Finally,experimental results on the annotated dataset show that the F1 scores of the three word-subwords interactive graph which enhance the semantic expression of words by utilizing subwords have been improved by 1.57%?1.82%and 0.53%over the benchmark system respectively.This thesis constructs a technology and terminology corpus oriented specific science domains,and proposes effective methods of technology and terminology recognition.Mean-while,this thesis explores the application of graph structure in technology and terminology recognition.These lays a foundation for further research of information extraction in specific science domains.
Keywords/Search Tags:Technology and Terminology Recognition, Specific Science Domains, Corpus, Subwords
PDF Full Text Request
Related items