Font Size: a A A

Research And Implementation Of Incarnated Accounts Identification Technology In MicroBlog Application

Posted on:2015-10-26Degree:MasterType:Thesis
Country:ChinaCandidate:J L BuFull Text:PDF
GTID:2348330509960759Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Chinese Micro Blog platform, Many views in the Micro Blog are subtly influence people's value judgment. Person is the subject of the speech communication, Micro Blog platforms are open to a person,who can have multiple virtual identities. The Incarnated Accounts is a special kind of accounts that appear during the supervision of public opinions. In this paper we studied how to find the Incarnated Accounts.The Incarnated Accounts is a phenomenon that an individual have two or above accounts in different time on the same Micro Blog platform. In this paper, on the basis of investigating the information sources about Micro Blog users in other researches, we finally chosen the account name, the time and the content of the blogs that are of high credibility as the information sources. Then we proposed a model to find out the Incarnated Accounts based on the approach of timing and similarity. the following are the main contributions of this paper:(1) Due to the limit of memory, knowledge and time and habits, a man named his different account names with similar ones. This paper studied the incarnated accounts naming rules and found that the target account name and his former account name(s) are highly similar. Based on the above, we designed the candidate set generation algorithm. The algorithm presented that the account will be selected into the candidate set if the account name contains at least a Chinese character in the target account name. And the algorithm can meet the requirements of the model.(2) According to the specificity of incarnated accounts, we proposed a Micro Blog timing tree algorithm. We knew that the former accounts published blogs before the active account for a period of time generally. we processed the blog times according to the timing, and accounts formed a tree with the target account as the root and the suspected former accounts as the nodes.(3) Based on the study of the text similarity algorithms, we improved the cosine similarity algorithm in two applications. The first is the super short text like account name. Their similarity could not only rely on the same word used to certify,but also rely on the text structure. So we combined the edit distance algorithm to improve the cosine similarity algorithm and achieved a better effect; the other is the blog similarity. We thought that blogs had the same topics if they had the same named entities. So we parted the text vector space into named entity text vector space other feature space, and increase the weights of the named entities. The algorithm can be extended to the text similarity calculation based on the topic.(4) Based on the above algorithms, we proposed a model targeted their characteristic based on the approach of timing and similarity algorithms, then verified the effectiveness of the model in Sina Micro Blog. The framework consisted of two major modules: the Identity Search and the Identity Matching. The first module use the candidate set generation algorithm to product the candidate set, and to avoid missing the real one; The second module matched the candidates based on the approach of timing and similarity algorithms to delete the accounts that are not the real ones as exactly as possible.In this paper, we also make the programming to achieve the model, and then verify the effectiveness of the model in Sina Micro Blog data. The model has a good platform portability as the sources information we used does not involve the privacy information and the information difficult to obtain. Finally, based on the results analysis, we presented a feasible approach to improve the model performance and put forward a direction for the further study.
Keywords/Search Tags:MicroBlog, matching and verification, Incarnated Accounts, timing, similarity
PDF Full Text Request
Related items