Font Size: a A A

Software Entity Recognition Method Based On Deep Learning

Posted on:2022-09-05Degree:MasterType:Thesis
Country:ChinaCandidate:C SunFull Text:PDF
GTID:2518306488460304Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of software into the era of networking and popularization,a variety of social programming sites and software knowledge communities are developing rapidly.These websites contain a lot of valuable information related to software development.For example,one of the functions of this type of website is to search for relevant discussions of specific software entities(such as libraries,tools,API),so as to find solutions to problems arising in the process of software development,tools usage,and so on.Therefore,a large amount of entity information about software engineering has been accumulated in the website,and the mining and research of these entity information can support the development of related entity-centered applications such as question-answering system,machine translation and text classification related to software engineering,and meanwhile provide the foundation for the construction of Knowledge Graphs in the field of software engineering.However,the current method of analyzing software engineering text treats software entities in the same way as other contents,which is not conducive to the development of related applications centered on entities.So the main purpose of this paper is to study the identification and classification methods of software entities in software engineering texts.Named Entity Recognition(NER)is one of the most basic tasks in natural language processing,Most of the research on traditional entity recognition is to recognize and classify the proper names such as person name,place name,and meaningful quantitative phrases such as time and date in the text,and there is no good entity recognition model applied in the field of software engineering,and the traditional Entity Recognition algorithm embedded word vector is static fixed not according to the context semantic to represent different uses of the same word.For software engineering texts,the existing entity recognition methods are limited to dictionary lookup based on code or text parsing technology and rule-based methods.However,there are cases of misuse of case and spelling errors in software engineering text,so the method based on dictionary and rule can not be well recognized.Therefore,this paper constructs a software entity recognition method based on deep learning.In the process of constructing the software entity recognition method,this paper mainly does the following work:(1)The pre-training language model BERT(Bidirectional Encoder Representation from Transformers)was introduced to solve the problem that the word vector generated by the traditional training model of word vector was static and could not represent the polysemant without considering the context semantics.Firstly,the text content of software engineering field is extracted and preprocessed by Stack Overflow official data dump.Then,the text content in the software engineering field is further pretrained on the basis of BERT model,so as to obtain the pre-trained language model more suitable for the software engineering field.Finally,in the process of building entity recognition model in software engineering domain,the characteristic representation of input data is obtained through the pre-training model.The experimental results show that the recognition effect of the model is significantly improved after the introduction of the BERT pre-training language model.(2)The graph convolution network is used to fuse syntactic information into entity recognition model,which increases the ability of model learning features,and solves the problem that simple word features can't distinguish and classify software entities well due to nonstandard writing of software engineering texts.Firstly,the syntactic dependency structure of the labeled data is analyzed by the relevant parser,the Syntactic information and word embedding are used as the input of the model,and then the syntactic information is integrated into the model by adding graph convolution network,so as to improve the recognition effect of the model.Experimental results show that the recognition accuracy of the model is improved by about 2% after adding graph convolution network.(3)This paper expands the existing small annotation data set to solve the problem of little annotation data in software engineering field.First,the question and answer data in the field of software engineering are obtained and the entity classification dictionary in this field is constructed.Then,the data is matched and annotated by the dictionary.At the same time,the entity recognition model based on the small-scale annotation data is used to predict and annotate the data set.Finally,the annotation results of the two methods are integrated and checked manually.Compared with complete manual annotation,it can greatly reduce the workload and improve the labeling efficiency.
Keywords/Search Tags:Software engineering, Named Entity Recognition(NER), BERT, Graph Convolutional Network(GCN)
PDF Full Text Request
Related items