Font Size: a A A

A Research Of Timing Events Based On Personal Micro-blog

Posted on:2015-10-02Degree:MasterType:Thesis
Country:ChinaCandidate:Z M NieFull Text:PDF
GTID:2298330422990187Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Micro-blog as an emerging social media services, had become an important platform ofsharing information and exchanging emotions, penetrated and influenced every aspect ofpeople’s lives. Micro-blog data were the carrier of personal resume emotion, because most ofthe personal micro-blog content recorded their life experiences, professional interest anddiscussion of hot topic, etc. Due to real-time, convenience and sometimes even second-fat ofmicro-blog, the personal micro-blog gradually replaced the diary to form time record and partrecord,but after a long time, the amount of micro-blog data will become very large, if youwant to understand bloggers can only browse its history by each micro-blog, which resulted thewasted of time. Micro-blog classified was proposed based on the question of how to understandthe bloggers’ dynamics quickly and accurately. During the process of micro-blog classified, theprecision of micro-blog similarity was determined by its accuracy, so how to improve theaccuracy of similar micro-blog is the research focus of this paper.Due to the personal micro-blog has the characteristic of large amount of data,short lengthof single,Content arbitrary,etc. Some limitations exist in using traditional classificationmethods and information extraction algorithms for processing the author expanded the textcharacteristic words in terms of same words to minimize the possibility of missing features,based on the single short micro-blog has the characteristic of the less valid feature and thecolloquial contents and then proposed an algorithm based on improved the Jaccard similarityand cosine similarity integrated similarity. First, the micro-blog data was filtered to remove anyno-information text and unrelated links, images, etc. Then, using of relevant Chinese Academyof Sciences Chinese lexical segmentation system-ICTCLAS disconnects the text word,tagsPOS and filters stop words and expressions words; Secondly, In order to improve the accuracyof similarity microblogging, it used an improved TF-IDF algorithm to extract micro-blogfeature words and LDA topic model to construct similar word template, which utilizing thefeature select evaluation function-CHI to measure the importance of each feature words for each category. At the same time, computed TF-IDF values to extract micro-blog feature words aftermaking the feature words conformed uniform distribution in the text; Then Then, On the basisof the extracted feature words and terms of similar structure on the template combine Jaccardsimilarity and cosine similarity to calculate Comprehensive similarity calculation of personalmicro-blog, The algorithm overcomes the inadequate of traditional approach based only wordco-occurrence, from similar words features and characteristics of the individual values and otheraspects to calculate the similarity of the two micro-blog deeply and comprehensively; Finally,utilized K-Means algorithm to classify timing events of individuals micro-blog, maked the sametopic micro-blog classified into the same collection.The experimental results show that the proposed comprehensive similarity algorithmsalgorithm has higher precision than the traditional similarity. To some extent, it improvedaccuracy of timing events classified of personal micro-blog.
Keywords/Search Tags:Feature words, Similar word template, Similarity, Event classification
PDF Full Text Request
Related items