Font Size: a A A

Benchmarking scripting languages, Microsoft .NET, and databases with a focus on text mining performance

Posted on:2008-11-14Degree:D.C.SType:Dissertation
University:Colorado Technical UniversityCandidate:Chadwick, Stephen CFull Text:PDF
GTID:1448390005973675Subject:Computer Science
Abstract/Summary:
In an increasingly connected world, the ability to quickly extract and process data and turn this data into useful information is becoming progressively more important. Text mining is focused on the extraction of information from unstructured data sources. Not only is the extraction speed critical, but the time it takes to implement a viable solution is just as important. The focus of this research is to investigate these two areas of text mining.; This research demonstrates that interpreted languages are a feasible alternative to compilation for text mining. Seven modern languages were selected and five experiments were designed to investigate common text mining operations. The languages consisted of four interpreted languages (Ruby, Perl, VBS, and Python), one hybrid .NET language (IronPython), and one compiled language (C# running on both the .NET 1.1 and 2.0 platforms) using the Microsoft Windows operating system.; The goal was to establish that interpreted languages could accomplish the same text mining task as compared to compiled languages, taking no more than twice as long based on wall-clock time, with a 25% reduction in lines of code. This research was successful. Both Perl, Python, and, in some cases, Ruby showed to be acceptable alternatives on execution speed. In some experiments the interpreted solutions actually executed faster than compilation (string concatenation). The lines of code reduction as compared to compilation greatly exceeded the 25%. Perl, Python, and Ruby showed reductions of 48, 58, and 54%, respectively.; Investigation into modern databases and text mining performance were also part of this research. Specifically, Microsoft Server 2000 (MSSQL2000), Microsoft Server 2005 (MSSQL2005), and Oracle 10g were investigated. Oracle10g's new regular expression syntax was also explored. In general, Oracle 10g currently offers the fastest overall performance when it comes to text mining queries. Also, stored procedures were investigated as compared to issuing the SQL text directly. Stored procedures were found to be faster although there was not a significant, practical difference.; Finally, this research outlines a benchmarking methodology that is applicable to both new emerging languages and database platforms with a focus on text mining.
Keywords/Search Tags:Text mining, Languages, Data, Focus, Microsoft, Net
Related items