Font Size: a A A

Search-Based Programming

Posted on:2016-05-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:D QiuFull Text:PDF
GTID:1108330488457715Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Programming is fundamental to software engineering, which is inseparable from the support of the pro-gramming languages. Inspired by the Sapir-Whorf Hypothesis that the language people speak influences peo-ple’s thought, many scientists in computer science believe that "programming language also influences pro-grammers’ thoughts". In recent decades, a massive amount of programming languages have been created, at-tempting to impact programmers by their design philosophy. However, different from natural languages, pro-gramming languages are designed by human. Most of their designs have been artistic, driven by aesthetic con-cerns and the intuitions of language architects. They may not capture programmers’ real thought. It leads to the fact that most programming languages are not widely-used by programmers. The reason is that such one way influence-the design of a programming language does have impact on programmers’ behavior-is liable to in-duce cognitive bias. We hold the opinion that programmers should be strongly influenced by what program-ming languages do; the design of programming languages should be strongly influenced by what programmers want to do. We hope to build the bidirectional influences, by means of the programmers’ feedback to impact the design of the programming languages.Programmers’ thoughts reflect by their behaviors, i.e. how they employ a programming language during a programming task. As plenty of software resources, especially open-source repositories, are publicly available to researchers, it is possible to empirically understand how programming languages are used in practice. In this dissertation, we quantitatively and qualitatively study the usage of the programming languages, and apply the "thoughts" extracted from the usage statistics to influence the programming language in turn, i.e. optimizing the designs of the programming languages via a data-driven approach. Specifically, in the scenario of employ-ing a single programming language, we analyze its usage from lexical, syntactic and API (Application Pro-gramming Interface) levels; in the scenario of employing multiple languages, we investigate the phenomenon of their co-evolution. The contributions of this dissertation are listed as follows:(1) We discuss the lexical distinguishability of programming languages and discover that "a unit of code is lexically distinguishable". We define and formalize the MINSET problem for rigorously testing the Wheat-Chaff hypothesis; We prove that MINSET is NP-hard and provide a greedy algorithm to solve it; We val-idate our central hypothesis-source code contains much chaff-against a large (100M LOC), diverse corpus of real-world Java programs and find that a unit of code can be uniquely identified by a small dis-tinguishing subset. On this basis, we explore several possible applications, including code search, code summarization and keyword-based programming.(2) We study how programmers employ syntactic rules in practice. Through a comprehensive empirical study over 5,000 Java open-source projects, we concentrate on the following research questions, including how Java syntactic rules are used in practice; how they are used over time; how strongly do rule usage depend on context. We discover that the usage of syntax rules is Zipfian; their usage exhibit nontrivial contextual dependency; some of the rules are gradually discarded by programmers; the newly-added rules do impact the use of the existing relevant rules; On this basis, we explore several possible applications, including language syntax design and restriction, identification of syntactic sugars, code recommendation and com-pletion based on the syntax contextual dependency.(3) We study how programmers employ language API in practice. We define a collection of systematic metrics of API usage, including frequency, popularity and coverage. Through a comprehensive empirical study over 5,000 Java open-source projects, we analyze the usage of the core API, and find some results beyond our expectation:a large number of API entities (including packages, classes, methods and fields) are not fully used in practice; we also analyze the usage of the third-party APIs, and find their usage obeys the power-law. Besides, we investigate three important issues, including the usage of deprecated API, the uti-lization of compact profiles (i.e. subset of the core API), and the adoption of the third-party libraries with multiple versions. On this basis, we explore several possible applications, including the optimization and restriction of the API, the construction of leaner compact profiles, the recommendation of libraries and their concrete versions, and API education.(4) We study the co-evolution of the multiple programming language usage. We select the database applica-tion, a typical software system implemented by at least two languages, and emphatically analyze how da-tabase schema evolve and how the schema changes impact the application code on ten real database ap- plications. On this basis, we present some guidelines for developing an automated database application evolution tool to aid the co-evolution of schema and code.The work in this dissertation, understanding the usage of programming languages, mining their usage charac-teristics, can positively affect the designs of programming languages, motivate new techniques in relevant ap-plications like code search and completion, and inspire new programming model, which approaches the ulti-mate target of promoting the capacity of programming.
Keywords/Search Tags:Lexical Distinguishability, Syntax Usage, API Usage, Co-Evolution
PDF Full Text Request
Related items