Font Size: a A A

Research On Side Information-based Detection Technology For Code Snippets

Posted on:2017-05-24Degree:MasterType:Thesis
Country:ChinaCandidate:L C LiFull Text:PDF
GTID:2308330485471009Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the popularity of Community Question-Answering (CQA) sites, blogs and so on, programmers can freely exchange programming techniques through such ways, so more and more source code snippets for solving various problems appear on the web. However, unlike plain text, the structural information in the code snippet can help improve the searchability of search engines for specific programming language source code. Unfortunately, we can not further analyze the structural information in the code snippet without knowing the programming language of the code snippet. Actually, through statistics, we empirically found that more than half source code snippets on the web have not been labeled by programming language tags, which makes detecting programming language for source code snippets become an urgent problem. Code s-nippets are often incomplete, which can not provide enough information to identify the programming language. But we can improve the accuracy of programming language detection by combining the side information of code snippets, such as tags and de-scriptions. To make full use of the side information of code snippets, we propose a side information-based programming language detection framework for code snippets and implement a system prototype. Our main contributions can be summarized as follows:1. We propose a novel side information-based programming language detection framework for code snippets. This framework first recommends tags for code snippets based on the description around them, and then trains detection model using the complete tags, and finally identifies programming language according to the trained model. Through the rational use of side information, this frame-work better predicts programming language for code snippets, which solves the problem of low identification accuracy.2. We propose a novel text keywords enhanced multi-label learning method for tag recommendation. Due to the fact that many tags have appeared in the content, this method combines multi-label learning and keyword extraction to enhance the tag recommendation accuracy, and further proposes a fast version, which employs the locality-sensitive hashing strategy to reduce the time complexity of the algorithm.3. We propose a tag information based programming language detection method for code snippets. This method defines the programming language identifica-tion problem for code snippets as text classification problem, and combines tag extraction technology and Bayesian classification technology to identify the pro-gramming language for code snippets, which enhances the identification accura-cy.4. We design and implement a system prototype based on the above methods, which preliminarily verifies the effectiveness and feasibility of the above methods and techniques.
Keywords/Search Tags:Programming Language Detection, Code Snippet, Side Information, Tag Recommendation
PDF Full Text Request
Related items