Font Size: a A A

Inferring Semantic Information from Natural-Language Software Artifacts

Posted on:2016-08-17Degree:Ph.DType:Dissertation
University:North Carolina State UniversityCandidate:Pandita, RahulFull Text:PDF
GTID:1478390017981313Subject:Computer Science
Abstract/Summary:
Specifications play and important role in software engineering for ensuring software quality. Not only do the specifications guide the development process by outlining what/how to reuse, they also help in the verification process by allowing testers to test the expected outcome. For instance, static and dynamic program analysis tools use formal specifications such as code contracts or assertions to detect violations of these specifications as defects. Likewise, API migration tools use predefined mapping specifications to perform automated migration. While highly desirable, oftentimes such formal specifications are missing from existing software. In contrast, these specifications are described in natural language in various software artifacts. Particularly, Application Programming Interface (API) documents that are targeted towards developers, are invaluable source of information regarding code-level specifications. However, (most of) existing developer productivity tools/frameworks are not designed to process the natural-language descriptions in software artifacts.;The goal of this work is to improve developer / tester / end-user productivity by accurately identifying specification sentences from the natural language text in software artifacts and representing them formally through adaptation of existing text analysis techniques..;Since natural-language software artifacts are often verbose, manually writing formal specifications from software artifacts may be resource intensive and error prone. To address this issue, this dissertation presents a text mining and natural language processing (NLP) framework to automate the task of inferring semantic information from natural-language in software artifacts to bridge the disconnect between the inputs required by software engineering tools/frameworks and the specifications described in natural-language software artifacts. Specifically, this dissertation reports on the process and effectiveness of applying text mining and NLP techniques in the following developer problems: 1. Specification of a method arguments and return values describes how to use a particular method within a class in terms of the expectations of the method arguments (preconditions) and expected return values (post-conditions). We propose a novel approach to infer these specifications from natural-language text of API documents. Our evaluation results show that our approach achieves an average of 92% precision and 93% recall in identifying sentences that describe such specifications from more than 2500 sentences of API documents. Furthermore, our results show that our approach has an average 83% accuracy in inferring specifications from over 1600 specification sentences.;2. Temporal specifications of an API are the allowed sequences of method invocations in the API. We propose ICON: an approach based on machine learning and NLP for identifying and inferring formal temporal constraints to assist tool based verification. Our evaluation results indicate that ICON is effective in identifying temporal constraint sentences (from over 4000 human-annotated API sentences) with the average precision, recall, and F-score of 79.0%, 60.0%, and 65.0%, respectively. Furthermore, our evaluation also demonstrates that ICON achieves an accuracy of 70% in inferring and formalizing 77 temporal constraints from these temporal constraint sentences.;3. API mappings facilitate tool based languagemigration.We propose TMAP: TextMining based approach to discover likely API mappings using the similarity in the textual description of the source and target API documents. We evaluated TMAP by comparing the discovered mappings with state-of-the-art source code analysis based approaches: Rosetta and StaMiner. Our results indicate that TMAP on an average found relevant mappings for 57% more methods compared to previous approaches. Furthermore, our results also indicate that TMAP found, on average, exact mappings for 6.5 more methods per class as compared to previous approaches.;4. Keeping malware out of mobile application markets is an ongoing challenge. To assist third-party testers for assessing risk of mobile applications, we present WHYPER, a framework using NLP techniques to identify sentences that describe the need for a given permission in an application description. WHYPER achieves an average precision of 82.8%, and an average recall of 81.5% for three permissions (address book, calendar, and record audio) that protect frequently-used security and privacy sensitive resources. These results demonstrate great promise in using NLP techniques in aiding the risk assessment of mobile applications.
Keywords/Search Tags:Software, NLP techniques, Specifications, API, Inferring, Results, Information, Sentences
Related items