Font Size: a A A

Research And Implementation On The Identification Of Authorship For Chinese Texts

Posted on:2008-12-16Degree:MasterType:Thesis
Country:ChinaCandidate:J ZhangFull Text:PDF
GTID:2178360242976300Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The advance in computer technology promotes the research in identification technologies. Various identification technologies have been applied to public security practices. The application of Chinese texts authorship identification technology can effectively assist the police in solving the issue of texts authorship identification.This paper proposes a multi-layer hybrid authorship identification model focusing on Chinese texts based on sequential minimum optimization (SMO) algorithm(SM-CTAI). In this model, texts are represented in three layers including characters, words and sentences layers. The above model based identification system consists of two components: training and identification ones. After texts are pre-proposed, they are segmented and their parts of speech are labeled. According to the processing productions, features are extracted in these three layers through calculating and formalizing. In this way, texts are represented by vectors in the three-layer hybrid vector space. After training texts are changed into vectors, identification model is established. The texts to be tested are also converted into vectors and identified by the identification model that has been already established. The experiments show that compared with the methods based on single layer features, this one has gained higher recall and precision.This paper gives new ideas in three aspects: first, it proposes that high-dimension features in character layer should be extracted in texts representation; second, hybrid feature combinations in character, word and sentence layers are used. In this way it can extract more information from texts than single layer features. Third, this method can be applied to assist solving authorship identification in public security practice.
Keywords/Search Tags:texts authorship identification, multi-layer hybrid, texts representation, SMO algorithm
PDF Full Text Request
Related items