Font Size: a A A

Authorship attribution on the Enron Email Corpus

Posted on:2014-09-16Degree:M.SType:Thesis
University:Duquesne UniversityCandidate:Li, XuanFull Text:PDF
GTID:2458390005994137Subject:Statistics
Abstract/Summary:
In this paper I present authorship attribution on an email corpus. The source I used was the Enron Email Corpus (Cohen, 2009). By reformatting these emails, four test sets were categorized based on the length of each email: Tiny (≤ 99 characters), Small (100 to 500 characters), Medium (501 to 999 characters), and Large (≥ 1000 characters). The Java Graphical Authorship Attribution Program (JGAAP software) from our Evaluating Variations in Language Laboratory (EVL Lab) was used to perform these tests. Three analysis methods: WEKA RandomForest, WEKA SMO, and Centroid with Cosine Distance were used. Results showed that the Large test set gave the best authorship classification, followed by the Medium, then the Small and the Tiny test sets. WEKA SMO gave better authorship classification than WEKA RandomForest.
Keywords/Search Tags:Authorship, Email, WEKA
Related items