Font Size: a A A

Stylometric fingerprints and privacy behavior in textual data

Posted on:2016-03-29Degree:Ph.DType:Thesis
University:Drexel UniversityCandidate:Caliskan-Islam, AylinFull Text:PDF
GTID:2478390017972512Subject:Computer Science
Abstract/Summary:PDF Full Text Request
Machine learning and natural language processing can be used to characterize and quantify aspects of human behavior expressed in language. Linguistic features exhibited in any kind of text can be used to study individuals' behavior as well as to identify an author among thousands of authors. Studying aspects of human behavior can be automated by incorporating machine learning techniques and well-engineered features that represent behavior of interest. Human behavior analysis can be used to enhance security by detecting malware programmers, malicious users, or abusive multiple account holders in online networks. At the same time, such an automated analysis is a serious threat to privacy, especially to the privacy of persons that would like to remain anonymous. Nevertheless, privacy enhancing technologies can be built by first and foremost understanding privacy infringing methods in-depth to create countermeasures.;Authorship attribution through stylometry, the study of writing style, in translated or unconventional text yields as high accuracy as the state-of-the-art accuracy in authorship attribution in English prose. Applying stylometry to the more structured domain of programming languages is also possible through a robust and principled method introduced in this thesis. Code stylometry is able to de-anonymize thousands of programmers with high accuracy while providing insight into software engineering. Programmer de-anonymization can aid in forensic analysis, resolving plagiarism cases, or copyright investigations. On the other hand, de-anonymizing programmers constitutes a privacy threat for anonymous contributors of open source repositories. Bridging the gap between natural language processing and machine learning is a powerful step towards designing feature sets that represent aspects of human behavior. Features obtained through natural language processing methods can be used to study the privacy behavior of users in large social networks. Aggregate privacy analysis shows that people with similar privacy behavior appear in clusters. This knowledge can be used to design privacy nudges and effective privacy preserving technologies. Machine learning can be incorporated on any kind of textual data to automate human behavior extraction in large scale.
Keywords/Search Tags:Behavior, Privacy, Machine learning, Natural language processing, Used
PDF Full Text Request
Related items