Scalable machine learning using applications in bioinformatics and cybercrime | Posted on:2016-10-23 | Degree:Ph.D | Type:Thesis | University:Southern Methodist University | Candidate:Drew, Jake M | Full Text:PDF | GTID:2478390017476450 | Subject:Computer Science | Abstract/Summary: | PDF Full Text Request | This thesis contributes multiple scalable machine learning applications in the fields of bioinformatics and cybercrime. A highly parallel framework for machine learning, called the Collaborative Analytics Framework is also presented. The framework leverages shared memory to efficiently process large datasets. Applications in bioinformatics gene sequence classification are implemented. In the gene sequence classification problem, unlabeled gene sequences are matched to sequences labeled with known taxonomies. Existing alignment-based methods are inefficient in practice and must balance performance by using shorter word lengths. Prior alignment-free methods do not scale efficiently as the number of trained sequences grows. A new alignment-free method, called Strand, is introduced. STRAND achieves as good or better accuracy than existing alignment-free methods, at improved speed and a reduced in-memory training database footprint. STRAND achieves this by exploiting a form of lossy compression called minhashing as part of an in-memory MapReduce-style framework. Strand is also applied to shotgun classification challenges for purposes of Abundance Estimation. Scalable machine learning applications are then applied to multiple cybercrime datasets. First, a method is presented to cluster criminal websites which are loose copies of one another. This general method is then applied to two specific cases, detecting thousands of copied Ponzi Scheme and Escrow Fraud websites. Second, a binary classifier is developed to examine search results for luxury goods to identify websites selling knock-offs. Finally, the Strand application is also used to detect various classes of malware data treating each malware's binary content as a gene sequence and successfully detecting large volumes of malware files with a high level accuracy and processing efficiency. | Keywords/Search Tags: | Scalable machine learning, Applications, Bioinformatics, Gene sequence, STRAND, Framework | PDF Full Text Request | Related items |
| |
|