Font Size: a A A

Temporal Keyword Search Over Social Media Data

Posted on:2017-01-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:F XiaFull Text:PDF
GTID:1108330485969051Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Social media service has become one of the most popular Internet services now. Peo-ple use it to record their daily experience, share or comment others’statuses. With the data accumulated continuously, those long-spanning data are useful in studying users’collec-tive behavior or obtaining a comprehensive understanding of people or events. As an easy-to-use tool, keyword search is also used to retrieve information from the massive so-cial media data. To track development of events, people may repeatedly submit the same keyword query to obtain recent statuses of events. In order to get full detail of analyzed objects, analyzers need to collect data of different periods. However, existing keyword search technologies emphasize on real-time keyword search. The publish time accompa-nied with social objects is merely used to measure their recency.In this thesis the social media stream model is proposed to model the original data, sharing and comment. It defines the reference time series for each social object. Given the model, the temporal keyword search uses keywords as the content predicate. The sum of values in time series falling in the query time window is computed. It is passed to corre-sponding ranking function and the query returns top k ranked items. The query promotes time dimension as a constraint to fulfill scenarios of event tracking and exploratory anal-ysis. Index structure and query algorithms are proposed from two aspects:leveraging the characteristic of social media data in off-line indexing scenario and improving efficiency of index updating in on-line indexing scenario. Finally, the rise and fall of Sina Weibo is analyzed to unveil the change of information propagation using time series analysis tech-niques. Besides, an on-line analytic platform is build over real-time social media stream. They demonstrate applications of the proposed query. The main contributions are list as following:· Two-tiers index structure and piecewise maximum approximation sketch are pre-sented to utilize characteristics of social media data. The distribution of refer- ence tree size and lifespan in social media data complies with the power law distribu-tion. Besides, social objects are typically heavily referenced during short periods, but keep cold at remaining time of their lifespan. Based on those two character-istics, two-tiers index structure and piecewise maximum approximation sketch are proposed. The two-tiers index structure utilizes different structures to manage pop-ular items and trivial items separately. Both structures are capable of filtering items using the temporal constraint. The selected items are returned in reverse order of their final reference tree size. The complexity analysis based on the power law distribution gives the upper bound on the number of items accessed by proposed algorithm. An analysis of the real-life dataset shows the upper bound is sublinear with respect to k for most queries. Furthermore, piecewise maximum approxima-tion sketch is presented to give a more tighter estimation of items’score in the query window. It can avoid score computation for long-living items that are cold in the query window.· Log-structured merge octree is designed to evaluate real-time temporal keyword query. Another important characteristic of social media data is the high generation velocity of user data. The situation is exacerbated during emergent events such as disasters or popular sport match. In the on-line indexing scenario, consuming those data efficiently and presenting them in the search results timely not only improve the user experience of normal people, but also provide evidence for timely decision making. In this thesis, the segments of social objects are mapped to points in three dimension space. The log-structured merge octree utilizes the octree to retain both temporal and importance locality of social items. The encoding scheme of octree node supports pruning data space by the time dimension. It also supports the ac-cessing order of social items required by temporal threshold algorithm. Through combination with the log-structured merge tree, the index benefits from the high speed of memory access and the improved efficiency of sequential disk access.· Applications based on the proposed temporal keyword query are given. The behavior of a fixed group of Sina Weibo users in a large time span is analyzed to unveil the change of information diffusion behind the rise and fall of Sina Weibo. Based on popularity model of a single microblog, the log normal distribution is used to fit parameters of the popularity model of a batch of tweets. One of the parameter in the log normal model is used as a measurement reflecting some aspect of infor-mation propagation. Features based on user behaviors in Sina Weibo and attitude to different social media platforms are defined. The relationship among them and the defined measurement is analyzed. Temporal keyword query helps discover hot events and understand user behaviors at different time period. Finally an on-line an-alytic platform is build based on the real-time microblog stream collected from Sina Weibo. It can cluster results of temporal keyword queries into topics and present elementary statistic results for each topic.In summary, the temporal keyword query is formalized, which extends the existing keyword search over social media data. Novel index structure and query algorithms are explored. They either utilize the characteristic of social media data or improve the update efficiency of index. Two analytic applications demonstrate the proposed query can be flexibly extended to various analytic scenarios. It helps digging critical information from social media data, and thus provides data for more complex analytic jobs. At last, an on-line system incorporating proposed index and analytic techniques is made publicly available. It helps researchers and analyzers benefit from the massive real-time social media data.
Keywords/Search Tags:Social Media, Top-k Query, Temporal Keyword Query, Time Series Data
PDF Full Text Request
Related items