Authorship attribution on the Enron Email Corpus

Posted on:2014-09-16

Degree:M.S

Type:Thesis

University:Duquesne University

Candidate:Li, Xuan

Full Text:PDF

GTID:2458390005994137

Subject:Statistics

Abstract/Summary:

In this paper I present authorship attribution on an email corpus. The source I used was the Enron Email Corpus (Cohen, 2009). By reformatting these emails, four test sets were categorized based on the length of each email: Tiny (≤ 99 characters), Small (100 to 500 characters), Medium (501 to 999 characters), and Large (≥ 1000 characters). The Java Graphical Authorship Attribution Program (JGAAP software) from our Evaluating Variations in Language Laboratory (EVL Lab) was used to perform these tests. Three analysis methods: WEKA RandomForest, WEKA SMO, and Centroid with Cosine Distance were used. Results showed that the Large test set gave the best authorship classification, followed by the Medium, then the Small and the Tiny test sets. WEKA SMO gave better authorship classification than WEKA RandomForest.

Keywords/Search Tags:

Authorship, Email, WEKA

Related items

1	Research On Chinese Email Authorship Identification System
2	Authorship Attribution in the Enron Email Corpus
3	Secure Email Server System
4	Research On Chinese Spam Filering Technology Based On Content Mining
5	Research On Authentication Of Online Authorship Or Article
6	S-Email:a Distributed End-To-End Secure E-Mail System
7	Email in style. Improving corporate email communications with employees at remote locations: A quantitative study
8	Research Of Database Access Log Based On Weka
9	Chat Mining For Authorship Verification
10	What is an author in the 'Sikuquanshu'? Evidential research and authorship in late Qianlong era China (1771--1795)