Font Size: a A A

Research On The Query Expansion Methods For Code Search

Posted on:2019-05-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:J HuangFull Text:PDF
GTID:1368330545999888Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As code repositories(e.g.,Codeplex,Github,Sourceforge)become available,code search has become a common activity during software development.Its performance strongly depends on words match between a query and query results.Thus this paper focuses on the query expansion method(QE)that retrieves a fixed number of top ranked expansion documents initially for a given query,extracts useful expansion terms from those documents,reformulates the original query with expansion terms,and retrieves the final set of query results.However,all existing QE methods neglect the nature of source code(e.g.,structural syntax and program semantics).They treat source code and expansion sources as plain text,still blindly copying the natural language methods.It is most common practice to determine the usefulness of expansion terms based on only the statistical information of the word item.For example,use TFIDF to measure the importance of expansion terms or use co-occurrence to measure the dependence between an expansion term and a query.This leads to 3 urgent problems to solve:“single query expansion","over expansion" and "coarse semantic understanding".First one,they always consider only one textual expansion source at a time;Second one,they expand a query with many irrelevant expansion terms to negatively impact the retrieval performance resulting in the query results could not meet the user search demands directly and have to be changed;Third one,they lack the skill to distinguish semantic meanings of different orders of code terms.To overcome 3 problems mentioned above,this paper exploits 3 novel expansion sources and proposes 3 novel QE methods,as follows:To solve the problem of single query expansion,this paper exploits a new compre-hensive textual expansion source Github Knowledge(GK)covering both Crowd Knowl-edge(CK)and APIs.It is extracted from "Pull requests" of code repositories on Github,containing descriptions of a request and commits,participants' comments and API in-formation of changed files.Then it proposes a novel query expansion method based on GK(QEGK).Besides,it also proposes a QE method based on SVM Ranking(QESR)to achieve QE methods integration.Our empirical evaluation shows QESR outperforms the state-of-the-art QE methods CodeHow and QECK by 8-15%in terms of Precision when the first query result is inspected.To solve the problem of over expansion,this paper considers the nature of source code,exploiting a novel non-textual expansion source,called evolving contexts,related to code directly.It contains code changes(new/deleted code terms)and their dependent terms from the code evolution.Then it proposes a QE method based on evolving contexts(QEEC).This method collects a large number of evolving contexts from code evolutions and applies an EM algorithm,a machine learning to train the inference model.By using the model,this method could predict the subsequent code changes a user will make after obtaining the retrieved code snippets.From those changes,it choose the new code terms as relevant expansion terms and the deleted code terms as irrelevant ones,expanding a query with appropriate expansion terms,i.e.,add relevant terms to a query and exclude irrelevant terms in a query.Our empirical evaluation shows that QEEC outperforms the state-of-the-art query expansion methods CodeHow and QECK by 11-18%and improves the precision of the code search algorithms IR,Portfolio and VF by up to 37-52%when the first result is inspected.To solve the problem of coarse semantic understanding,based on evolving contexts,this paper exploits a novel non-textual expansion source,called the semantics of change sequences,and proposes a QE method based on the semantics of change sequences(QESC).This method collects a large number of change sequences,and employs DBN,a deep learning,to learn the semantics of change sequences.Based on the semantic similarity,this method could predict the subsequent code changes a user will make af-ter obtaining the retrieved code snippets,reformulating a query with the new and the deleted code terms from those changes.Our experimental results show QESC outper-forms the state-of-the-art QE methods by 14-21%in terms of precision on inspecting the first query result.
Keywords/Search Tags:code search, free-form text, query expansion, code changes, deep learning
PDF Full Text Request
Related items