Font Size: a A A

Authorship Verification With Latent Dirichlet Allocation

Posted on:2014-05-15Degree:MasterType:Thesis
Country:ChinaCandidate:X J MengFull Text:PDF
GTID:2268330392969072Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet, the way of people communication has changed greatly. Instant communication as a kind of modern communication way is becoming more and more popular, and become the main way in work and daily life. However, it not only brings convenience but also brings security hole. When we chat with someone by instant tools, we always ignore his real identity. So many illegal people steal other people’s ID and password, and then they chat with us as our friend. As a result, we may be leak out our personal information, or cause loss in money.In this paper, the main research content is how to prevent this security problem to let people communicate safely. In modern time, there are many instant tools, such as MSN, AOL, QQ and so on. Though these tools also set up some functions of safety inspection by some sensitive words, such as bank, account, buy, sell and so on, in many cases those illegal people not only defraud out money but also want to our personal information to do illegal transaction. Therefore, the essence of solving this problem is identifying the identity correctly. As we all know, we chat with other people by text in general. Though we also send some pictures or expression and so on, text is the main form. So, the object of this paper is text information, namely chat logs. We identify the identity by judge the difference among people in the way of speaking and tone. The main contribution of this paper is described as follows. Firstly, considering the specificity of instant messages we just extract modal particle, punctuation, auxiliary word and other some words that have no significance ignoring noun, adjective. Secondly, in extracting features we have no longer based on word frequency but apply topic feature to solve this problem. Thirdly, for those topics that we have extracted we delete those topics that have little effect to final classification result and only reserve those topics that have great affect to classification. Fourthly, because this topic model only considering the features of text content, we put the structure feature to topic model and then use the mixture feature to identify the identity. Experimental result shows: firstly, this topic model is appropriate to identify the identity. Secondly, after sifting the topic the correct rate is improved. Thirdly, the length of text, the topic number, and the way of extracting feature can affect the final result.
Keywords/Search Tags:instant communication, topic model, authorship verification, featureselection
PDF Full Text Request
Related items