Font Size: a A A

Research On Authorship Identification For Chinese Texts

Posted on:2013-03-27Degree:MasterType:Thesis
Country:ChinaCandidate:J WanFull Text:PDF
GTID:2248330395985129Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Authorship Identification has been a hot research field and it can be applied widely in many areas, such as the verification on the literature works, copyright pro-tection and information security field of forensic investigation on the malicious e-mails. In recent years, the phenomenon of plagiarism on the literature works and the papers has been more and more. Compared to the watermark solution, authorship identification can resolve the problem much better. A crucial issue of authorship iden-tification is to find a set of features which can represent the writing style of a par-ticular author. Due to Chinese special characteristics, the features effectively used for identifying Chinese authorship are relatively few. Another problem is that when comes to the large number of authors (larger than20), the identification performance will be much worse.In this paper, we investigate some methods of Chinese authorship identification. In order to solve the problems addressed above, we propose two methods of Chinese authorship identification, namely, Chinese authorship identification based on De-pendency Grammar, and two-layer classification of Chinese authorship identification based on the senses of words. In the formal method, taking advantage of the Chinese Natural Language Processing techniques, we extract the dependencies as the syntactic features for identification, and then extract another three kinds of features:empty word, punctuation, and Part of Speech, together with dependency to comprise a large feature set. Considering a large feature sets usually contain some noises which will affect the identification accuracy, in order to get higher accuracies, we use Principal Component Analysis (PCA) to optimize the feature set. The latter method employs the two-layer classification model of authorship identification in which a group layer is added. In the group layer, we propose a method of Chinese author representation based on the senses of words to obtain an author vector for each author, and then the cluster algorithm is used to induce the clusters or groups based on the vectors repre-sentation of users. After identifying the potential group, author identification can be applied locally within that group. Identifying an author within a group that contains a limited number of authors (generally less than20) is more accurate and practically achievable than doing the classification over the full set of authors.In this paper, we conduct the experiments to demonstrate the effectiveness of the proposed methods. Together with Support Vector Machine, the experimental results demonstrate not only the dependency is an effective feature, and PCA can provide much better identification performance, but also the method of two-layer classifica-tion of Chinese authorship identification based on the sense of words can effectively achieve the higher accuracy compared with the normal identification methods which are applied to a large number of authors.
Keywords/Search Tags:Authorship Identification, Dependency Grammar, Word Sense Tags, Support Vector Machine, Principal Component Analysis, ClusterAlgorithm
PDF Full Text Request
Related items