Twitter is one of the world’s leading social media.On the twitter platform,huge numbers of UGC have been generated incessantly.To explore the distribution and correlation from Twitter data helps accurate personalized recommendation.In recent year,numerous studies have been conducted to explore the interest of Twitter users.However,most of them can only identify a single dimension from tweets instead of the whole dimension hierarchy.On the other hand,in the field of business intelligence,as a massive data analysis and processing technology,online analysis processing allows users to explore data from different dimensions interactively and provides such an intuitive form that is suitable for people to explore the Twitter data.In order to apply OLAP to twitter data,we present an efficient method for cleaning twitter data based on multi-feature fusion and a LDA-based method for extracting user interest dimension hierarchy.The former scores on the tweets based on the text features of the tweets,social characteristics and features of the topic.We then eliminate the noise tweets below the threshold,laying the foundation for the subsequent user interest extraction.According to the differences of tweets and traditional documents,the latter one modifies the LDA including redefining the generation process of tweets,adding the level of sub-interest and the semantic feature of tweets’ words and then constructing the interest-oriented topic model,i.e.MS-LDA,which combines the users’ tweets and social information.Then we construct the interest dimension hierarchy which is suitable for OLAP according to the user sub-interests and interests mined by MS-LDA to explore the Twitter users’ interests.Finally,we conduct extensive experiments on a large real data set to verify that our method can effectively extract the interest dimension hierarchy which is suitable for OLAP operations such as roll-up and drill-down.Meanwhile,when compared with other similar methods,our method can recognize the interest dimension with the higher precision and coverage. |