Font Size: a A A

Developing a Cybersecurity Text Corpus and its Application for Augmenting Semantic Text Similarity

Posted on:2015-06-29Degree:M.SType:Thesis
University:University of Maryland, Baltimore CountyCandidate:Chavan, Manish PadmakarFull Text:PDF
GTID:2478390017492757Subject:Computer Science
Abstract/Summary:
The growing use of cyber-services automatically impart great importance to cybersecurity. The Internet is a primary source of information regarding software flaws, vulnerabilities, cyber-attacks and exploits. This information is available through vulnerability databases, news articles, security bulletins and blogs. Variety of applications and security systems like Intrusion Detection Systems (IDS), Intrusion Prevention System (IPS), etc. can take advantage of this information for consolidating their infrastructure. The lack of availability of ready text corpus of high quality security information from various sources makes it difficult for these applications to use this information. To overcome this problem our work focuses on building a multi-genre corpus of security text using information retrieved from multiple internet based sources; National Vulnerabilities Database, Wikipedia articles, security blogs, security bulletins and scholarly papers. The system builds a text classifier from the initial high quality data which is used to classify and accommodate new data from these sources into the corpus.;This corpus can be used by variety of applications like IDS or IPS, in variety of ways like assertion into knowledge base or extraction of named entities. Our work explores one of the applications of generating the semantic text similarity model for cybersecurity text. We use the multi-genre cybersecurity text corpus for creating the word co-occurrence model. This model can extract the synonymity between the different security terms. For example, the words ' virus' and 'malware' that have same context are scored for their degree of similarity. The word co-occurrence model is then extended to generate a semantic text similarity model.The text similarity model extracts the semantic text similarity between different security texts like titles of the papers, vulnerability descriptions, blog paragraphs, etc. The system also develops a combined text similarity model from cybersecurity similarity model and generic text similarity model. This model can be used in document mining for matching security text, clustering documents describing similar vulnerabilities and so on.
Keywords/Search Tags:Security, Text, Model, Information
Related items