Font Size: a A A

Concept exploration and discovery from business documents for software engineering projects using dual mode filtering

Posted on:2015-02-11Degree:D.EngType:Thesis
University:Ecole de Technologie Superieure (Canada)Candidate:Menard, Pierre AndreFull Text:PDF
GTID:2478390020952232Subject:Computer Engineering
Abstract/Summary:
This thesis present a framework for the discovery, extraction and relevance-oriented ordering of conceptual knowledge based on their potential of reuse within a software project. The goal is to support software engineering experts in the first knowledge acquisition phase of a development project by extracting relevant concepts from the textual documents of the client's organization. Such a time-consuming task is usually done manually which is prone to fatigue, errors, and omissions. The business documents are considered unstructured and are less formal and straightforward than software requirements specifications created by an expert. In addition, our research is done on documents written in French, for which text analysis tools are less accessible or advanced than those written in English. As a result, the presented system integrates accessible tools in a processing pipeline with the goal of increasing the quality of the extracted list of concepts.;Our first contribution is the definition of a high-level process used to extract domain concepts which can help the rapid discovery of knowledge by software experts. To avoid undesirable noise from high level linguistic tools, the process is mainly composed of positive and negative base filters which are less error prone and more robust. The extracted candidates are then reordered using a weight propagation algorithm based on structural hints from source documents. When tested on French text corpora from public organizations, our process performs 2.7 times better than a statistical baseline for relevant concept discovery. We introduce a new metric to assess the performance discovery speed of relevant concepts. We also present a method to help obtain a gold standard definition of software engineering oriented concepts for knowledge extraction tasks.;Our second contribution is a statistical method to extract large and complex multiword expressions which are found in business documents. These concepts, which can sometimes be exemplified as named entities or standard expressions, are essential to the full comprehension of business corpora but are seldom extracted by existing methods because of their form, the sparseness of occurrences and the fact that they are usually excluded by the candidate generation step. Current extraction methods usually do not target these types of expressions and perform poorly on their length range. This article describes a hybrid method based on the local maxima technique with added linguistic knowledge to help the frequency count and the filtering. It uses loose candidate generation rules aimed at long and complex expressions which are then filtered using n-grams semilattices constructed with root lemma of multiword expressions. Relevant expressions are chosen using a statistical approach based on the global growth factor of n-gram frequency. A modified statistical approach was used as a baseline and applied on two annotated corpora to compare the performance of the proposed method. The results indicated Our second contribution is a statistical method to extract large and complex multiword expressions which are found in business documents. These concepts, which can sometimes be exemplified as named entities or standard expressions, are essential to the full comprehension of business corpora but are seldom extracted by existing methods because of their form, the sparseness of occurrences and the fact that they are usually excluded by the candidate generation step. Current extraction methods usually do not target these types of expressions and perform poorly on their length range. This article describes a hybrid method based on the local maxima technique with added linguistic knowledge to help the frequency count and the filtering. It uses loose candidate generation rules aimed at long and complex expressions which are then filtered using n-grams semilattices constructed with root lemma of multiword expressions. Relevant expressions are chosen using a statistical approach based on the global growth factor of n-gram frequency. A modified statistical approach was used as a baseline and applied on two annotated corpora to compare the performance of the proposed method. The results indicated an increase of the average F1 performance by 23.4% on the larger corpora and by 22.2% on the smaller one when compared to the baseline approach.;Our final contribution helped to further develop the acronym extraction module which provides an additional layer of filtering for the concept extraction. This work targets the extraction of implicit acronyms in business documents, a task that have been neglected in the literature in favor of acronym extraction for biomedical documents. Although there are overlapping challenges, the semi-structured and non predictive nature of business documents hinders the effectiveness of the extraction methods used on biomedical documents, and fail to deliver the expected performance. Explicit and implicit acronym presentation cases are identified using textual and syntactical hints. Among the 7 features extracted from each candidate instance, we introduce "similarity" features, which compare a candidate's characteristics with average length-related values calculated from a generic acronym repository. Commonly used rules for evaluating the candidate (matching first letters, ordered instances, etc.) are scored and aggregated in a single composite feature which permits a flexible classification. One hundred and thirty-eight French business documents from 14 public organizations were used for the training and evaluation corpora, yielding a recall of 90.9% at a precision level of 89.1% for a search space size of 3 sentences.
Keywords/Search Tags:Business documents, Discovery, Software engineering, Using, Extraction, Concept, Corpora, Expressions
Related items