Font Size: a A A

Research Of Twitter Retrieval Based On Semantic Similarity Computing And Twitter Storm Platform

Posted on:2015-10-06Degree:MasterType:Thesis
Country:ChinaCandidate:H F XiaoFull Text:PDF
GTID:2298330452950784Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet industry, micro-blogging products aregaining popularity both at home and abroad. They have gradually developed into anew type of media holding increasingly high influence by providing users withcentralized and open social networking services. Given the large scale and real-timecharacteristics micro-blogging data have, how can we provide user-interestedinformation from massive and dynamically updated micro-blogging data isparticularly important now.Micro-blog retrieval and sorting method discussed in this paper is based on shorttext feature expansion and similarity calculation. Our paper is presented as followingstructures: firstly, each micro-blog(tweet here) has been expanded (make it longer) toenrich its semantic feature, which provides solid guarantee for the relatednessbetween query text and retrieved results; secondly, we try to get similarity resultsbetween micro-blogs with relatively high precision and recall using WordNetdictionary; thirdly, the similarity value computed in last step has been taken as thecriteria for sorting to simulate a real-time micro-blog retrieval environment, whichcould complete micro-blog retrieval and sorting and would provide a list of relatedmicro-blogs for each micro-blog retrieved.In order to enrich the semantic feature of micro-blogs, we take nouns inmicro-blogs as representative keywords that expressed micro-blog topics, and expandthese nouns with associated words and phrases to enlarge micro-blog. Specifically,Wikipedia are chosen as the source of semantic feature for expansion. For each nounin a micro-blog, we take it as query in Wikipedia, find the specific result entry–category-in search result page, and take the words under the “category”(categoriesthe specific noun are classified to) as additional semantic explaining words adding tothe original micro-blogs. Also, experiments are conducted to prove that this extensioncould improve the similarity calculation quality in a certain degree. In order to gethigher accuracy and precision, this paper takes full advantage of the special structureof online English Word database-WordNet in computing semantic-based similaritybetween micro-blogs. Specifically, we use the path-length-based method proposed in[37], which take into consideration both the node path length and the least commonsubsumer in WordNet. Also, we conduct experiments to compare our method withtraditional vector space model-based cosine similarity computing method to verify that the former could improve Precision and Recall in finding related micro-blogs tosome extent. In order to simulate the real-time micro-blog retrieval system, we studiedthe architecture and application of the open-source real-time data processing platformTwitter Storm carefully, and simulate the real-time and distributed processing in localmode. Specifically, we defined our own micro-blog retrieval topology that can beembedded into Twitter Storm platform and implemented the function of eachcomponent in the topology, including the preprocessing of original tweets dataset,information transmission between components, parallel computing of tweetssimilarity in many components, the maintenance of similarity table, sorting ofretrieved results based on similarity value, and providing related micro-blogs for eachmicro-blog in search result, etc.
Keywords/Search Tags:Twitter, Weibo, Semantic expansion, Similarity computing, WordNet, Twitter Storm
PDF Full Text Request
Related items