Font Size: a A A

Machine learning method for authorship attribution

Posted on:2016-05-10Degree:Ph.DType:Dissertation
University:Michigan State UniversityCandidate:Hu, XianfengFull Text:PDF
GTID:1478390017479251Subject:Mathematics
Abstract/Summary:
Broadly speaking, the authorship identification or authorship attribution problem is to determine the authorship of a given sample such as text, painting and so on. Our main work is to develop an effective and mathe-sound approach for the analysis of authorship of doubted books.;Inspired by various authorship attribution problems in the history of literature and the application of machine learning in the study of literary stylometry, we develop a rigorous new method for the mathematical analysis of authorship by testing for a so-called chrono-divide in writing styles. Our method incorporates some of the latest advances in the study of authorship attribution, particularly techniques from support vector machines. By introducing the notion of relative frequency of word and phrases as feature ranking metrics our method proves to be highly effective and robust.;Applying our method to the classical Chinese novel Dream of the Red Chamber has led to convincing if not irrefutable evidence that the first 80 chapters and the last 40 chapters of the book were written by two different authors.;Also applying our method to the English novel Micro, we are able to confirm the existence of the chrono-divide and identify its location so that we can differentiate the contribution of Michael Crichton and Richard Preston, the authors of the novel.;We have also tested our method to the other three Great Classical Novels in Chinese. As expected no chrono-divides have been found in these novels. This provides further evidence of the robustness of our method.;We also proposed a new approach to authorship identification to solve the open class problem where the candidate group is nonexistent or very large, which is reliably scaled from a new method we have developed for the closed class problem in which the candidates author pool is small. This is attained by using support vector machines and by analyzing the relative frequencies of common words in the function words dictionary and most frequently used words. This method scales very nicely to the open class problem through a novel author randomization technique, where an author in question is compared repeatedly to randomly selected authors. The author randomization technique proves to be highly robust and effective. Using our approaches we have found answers to three well known authorship controversies: (1) Did Robert Galbraith write Cuckoo's Calling? (2) Did Harper Lee write To Kill a Mockingbird or did her friend Truman Capote write it? (3) Did Bill Ayers write Obama's autobiography Dreams From My Father?.
Keywords/Search Tags:Authorship, Method, Problem, Write
Related items