Font Size: a A A

Authorship Attribution In Social Media Texts

Posted on:2022-12-16Degree:MasterType:Thesis
Country:ChinaCandidate:J B LiuFull Text:PDF
GTID:2518306764480124Subject:Journalism and Media
Abstract/Summary:PDF Full Text Request
Authorship attribution is an important branch of natural language processing by measuring some text features to distinguish texts written by different authors.With the remarkable growth of information technology,the number of practical applications of authorship attribution has grown in several different fields,such as criminal law,civil law,and computer security.Each author has special habits that affect the form and content of their written work,and characteristics that can often be quantified and measured using machine learning algorithms.Early researches mainly designed features manually for these features,but the hand-designed features often only extract a part of the features as writing style features.Since deep learning methods can automatically extract text-rich features,more and more studies have used deep learning methods to solve the authorship attribution problem in recent years,but most of the research have only verified the effect of the algorithm on long text datasets.Among the methods studied on short text datasets,some only use character n-grams features to extract text content features,and a single feature cannot fully represent writing style features? some use character n-grams features as content features and constituency tree features are used as syntax features to solve the problem of authorship attribution,but on short texts,the path of the constituency tree is shorter,which will undoubtedly affect the richness of syntax features.Most of the current methods do not pay attention to the syntax features of the text or only extract the shallow-level syntax features.In response to the above problems,this thesis proposes an authorship attribution model based on syntax dependency tree and syntax constituency tree.First,the use of the structural features extracted by the dependency tree can solve the problem that the constituency tree path of the short text is short,and the use of the constituency tree can solve the problem that the word embedding vector of the dependency tree has no syntax features.These two trees have complementary advantages and disadvantages.Second,we propose a novel tree-structured feature to enrich syntax features.We number each node in the tree,which can be recovered as a tree using node number and parent node number of the tree according to parent notation of the tree.Finally,the content features and syntax features extracted from the character 2-grams feature are used as writing style features and experiments are performed on multiple datasets to verify the effectiveness of our model.In response to the problem that current methods do not deeply explore attention on the authorship attribution task,we propose a model that combines multiple attentions.We use self-attention,hierarchical attention and graph attention to respectively focus on the influence of different features of text on writing style features.Attention can pay attention to important features and ignore unimportant features.We validate the superiority of our method on multiple datasets.Finally,based on the current research status of authorship attribution,we further explore the challenges and future development trends in the field of authorship attribution.
Keywords/Search Tags:Natural Language Processing, Authorship Attribution, Syntax Tree, Attention Mechanism
PDF Full Text Request
Related items