Research On The Query Expansion Methods For Code Search

Posted on:2019-05-17

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J Huang

Full Text:PDF

GTID:1368330545999888

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

As code repositories(e.g.,Codeplex,Github,Sourceforge)become available,code search has become a common activity during software development.Its performance strongly depends on words match between a query and query results.Thus this paper focuses on the query expansion method(QE)that retrieves a fixed number of top ranked expansion documents initially for a given query,extracts useful expansion terms from those documents,reformulates the original query with expansion terms,and retrieves the final set of query results.However,all existing QE methods neglect the nature of source code(e.g.,structural syntax and program semantics).They treat source code and expansion sources as plain text,still blindly copying the natural language methods.It is most common practice to determine the usefulness of expansion terms based on only the statistical information of the word item.For example,use TFIDF to measure the importance of expansion terms or use co-occurrence to measure the dependence between an expansion term and a query.This leads to 3 urgent problems to solve:�single query expansion","over expansion" and "coarse semantic understanding".First one,they always consider only one textual expansion source at a time;Second one,they expand a query with many irrelevant expansion terms to negatively impact the retrieval performance resulting in the query results could not meet the user search demands directly and have to be changed;Third one,they lack the skill to distinguish semantic meanings of different orders of code terms.To overcome 3 problems mentioned above,this paper exploits 3 novel expansion sources and proposes 3 novel QE methods,as follows:To solve the problem of single query expansion,this paper exploits a new compre-hensive textual expansion source Github Knowledge(GK)covering both Crowd Knowl-edge(CK)and APIs.It is extracted from "Pull requests" of code repositories on Github,containing descriptions of a request and commits,participants' comments and API in-formation of changed files.Then it proposes a novel query expansion method based on GK(QEGK).Besides,it also proposes a QE method based on SVM Ranking(QESR)to achieve QE methods integration.Our empirical evaluation shows QESR outperforms the state-of-the-art QE methods CodeHow and QECK by 8-15%in terms of Precision when the first query result is inspected.To solve the problem of over expansion,this paper considers the nature of source code,exploiting a novel non-textual expansion source,called evolving contexts,related to code directly.It contains code changes(new/deleted code terms)and their dependent terms from the code evolution.Then it proposes a QE method based on evolving contexts(QEEC).This method collects a large number of evolving contexts from code evolutions and applies an EM algorithm,a machine learning to train the inference model.By using the model,this method could predict the subsequent code changes a user will make after obtaining the retrieved code snippets.From those changes,it choose the new code terms as relevant expansion terms and the deleted code terms as irrelevant ones,expanding a query with appropriate expansion terms,i.e.,add relevant terms to a query and exclude irrelevant terms in a query.Our empirical evaluation shows that QEEC outperforms the state-of-the-art query expansion methods CodeHow and QECK by 11-18%and improves the precision of the code search algorithms IR,Portfolio and VF by up to 37-52%when the first result is inspected.To solve the problem of coarse semantic understanding,based on evolving contexts,this paper exploits a novel non-textual expansion source,called the semantics of change sequences,and proposes a QE method based on the semantics of change sequences(QESC).This method collects a large number of change sequences,and employs DBN,a deep learning,to learn the semantics of change sequences.Based on the semantic similarity,this method could predict the subsequent code changes a user will make af-ter obtaining the retrieved code snippets,reformulating a query with the new and the deleted code terms from those changes.Our experimental results show QESC outper-forms the state-of-the-art QE methods by 14-21%in terms of precision on inspecting the first query result.

Keywords/Search Tags:

code search, free-form text, query expansion, code changes, deep learning

PDF Full Text Request

Related items

1	Code Search With Free-from Text As Input
2	A source code search engine for keyword based structural relationship search
3	Research On Text-oriented Code Search
4	Research On Code Function Description And Code Search Method Based On Deep Learning
5	Research On Code Segment Search Method For Open Source Ecology
6	A Code Description Semantics Vector Based Java Code Search
7	Research On The Method Of Generating Code Fragments In Response To Free-form Queries
8	Research On Code Search Technology Based On Features Of Code And Comment
9	Research On Natural Language Code Search: Benchmark,Empirical Study,and New Method
10	Research On Channel Decoding Algorithm Based On Deep Learning