Font Size: a A A

Research And Application Of New Words Detection For Programming Domain

Posted on:2022-08-27Degree:MasterType:Thesis
Country:ChinaCandidate:M LiuFull Text:PDF
GTID:2518306494981239Subject:Software engineering
Abstract/Summary:PDF Full Text Request
There are a large number of programming domain texts and domain words.Jieba word segmentation has been used in the general domain,and has achieved good results.However,because some programming domain words do not appear in the word segmentation dictionary,the accuracy of jieba word segmentation used in programming domain is not high.If we can use the new words detection algorithm to detect these domain words from the domain texts,we can expand the domain vocabulary,so as to improve the effect of Chinese word segmentation in the programming domain.At the same time,the discrete domain words are organized by the way of knowledge graph,which can help the learners of programming to learn more efficiently and systematically.For the research of new words detection,the commonly used supervised methods need a large number of labeled data,while the unsupervised methods often have low precision and are difficult to achieve good results.Therefore,this paper focuses on the new words detection algorithm in the programming domain.Firstly,the corpus of programming problem-solving reports is constructed.This paper uses web crawler technology to crawl the problem-solving reports from blogs,communities and other websites,and carries out data preprocessing operation for the new words detection task of this paper,so as to solve the problem of open data set without text specification in programming domain.In order to promote the research of related tasks in the programming domain,this paper will make the problem-solving reports public.Secondly,several commonly used new words detection algorithms are discussed.Aiming at the problem that there are too many garbage word strings in the results of statistics-based method and word embedding based method,a method based on the combination of statistics and word embedding is proposed,this method improves the precision of new words detection.The experimental results show that this method has a good detection effect for one kind of words which rarely appear in other domains but often appear in the programming domain,but it has a poor detection effect for another kind of words which also appear in other domains and have special significance in the programming domain.For the second kind of domain words,the current phrase quality estimate method(Class Phrase)can effectively detect them,which solves the shortcomings of the method based on the combination of statistics and word embedding.However,when the label quality is poor,Class Phrase method is difficult to train an effective model.In order to improve the quality of labels,this paper proposes a method of using distant supervision to generate labels for classification model training according to the domain vocabulary.The experimental results show that this method achieves good results.Then,this paper sorts out the information of domain vocabulary,crawler technology,such as word definitions and title numbers,organizes the information through the way of knowledge graph,and then uses the domain new words detected by the new words detection algorithm to expand the knowledge graph,and stores them in the Neo4 j graph database.Finally,the new words detection and query system is implemented,and different functions are set for different user identities.The main functions include uploading user-defined files to realize new words detection and knowledge graph expansion and query.
Keywords/Search Tags:new words detection, programming, distant supervision, classification model, knowledge graph
PDF Full Text Request
Related items