Font Size: a A A

Research On Data Characterization

Posted on:2014-02-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:C SunFull Text:PDF
GTID:1268330398486225Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
There’re many applications of data characterization in data management and mining area. Data characterization is to reduce the scale of raw data and meanwhile reserve its characteristic. Based on the different data type and application environment, data characterization can be classified under three categories:characterization of value distribution, characterization of relation and characterization of structural data. Characterization of value distribution is mainly used by query engine to do the selectivity estimation and including the histogram method and the wavelet method. Characterization of relation is also called table summarization and mainly used in data achieving and huge table exploration (e.g. Exploring huge tables in handheld device). The semantic table summarization is the popular method now. Plenty of objects in the real world are modeled as graphs. However, it is hard to understand the information encoded in the massive graphs. Semantic graph summarization can help us to do that.There’re common features of above problem:reducing the size of dataset, controlling the information loss of the characterization procedure and the result dataset are measured by some kind of error function. How to improve the quality of the summarized data, the space and time efficiency of the summarization algorithms, what’ more, how to adjust the general algorithms in specific application are the main challenges.L2norm error measure is based on the assumption that every value point in the value distribution has the same possibility of being accessed. However, in the real application this assumption cannot be satisfied, such as the famous Pareto’s raw. We can modify the error measure by considering the access distribution of workload in order to get the quality improvement of the histogram. In the other hand, the time cost of the optimal histogram construction algorithms usually are polynomial level. In some environment, an linear time algorithm is needed. We can scarify the quality and built the approximate optimal histogram in that case. The existed wavelet synopsis construction algorithms focus on which coefficients (in the wavelet coefficients collection produced by wavelet transformation on raw dataset) should be retained. In the classic methods the default value of all unselected coefficients is assigned with zero. However, it is proved that setting the default value is also very important for improving the quality of the wavelet synopsis. The calculation methods under the L2and Minmax error measures are proposed.The existed method translates Table summarization to a Set-Covering problem and spends much cost in the problem translation which makes it impractical. In that case, the metric space of tuples with multi-attributes’ hierarchical structure are defined and the problem is modeled as a clustering problem in a hierarchical space. Two algorithms are proposal, one is hierarchical agglomerative method and the other is based on the idea of adjusting the resolution of the hierarchical space.Allocating the raw nodes of graph into some different groups and establishing relationships among the groups by considering the raw edges of graph, we call this procedure graph summarization and the newly built graph is called graph summary. Users can set the scale value of graph summary and obtain the information of raw graph with the help of it. However, by using the classic method, it can only build the graph summaries whose sizes are above a certain scale lower bound, which cannot be acceptable in many applications. a novel graph summarization method is introduced which can solve the scale limits. By using the concept hierarchy of the nodes’ attributes, the new method can group the nodes in a flexible way. It groups the nodes not only with same values but also with similar values. Besides the edges’ information loss, it also considers the nodes’ information loss during the summarization and model the summarization as multi-objective planning. Two hierarchical agglomerative algorithms are proposal, one is based on forbearing stratified sequencing method and the other is based on unified objective function method.When the raw graph is very huge, the normal clustering based graph summarization method is not suitable, due to the high time cost. In that case, a special method is needed for the graph summary construction of the graph with massive nodes. A new method is introduced to solve the problem. The new method makes use of space resolution adjustment to find an small graph view, and then construct the summary graph based on the view. How to do the resolution adjustment is very important to the quality of the final summary, two strategies are introduced. It is proved that the method is effective to summarize the graph with massive nodes.
Keywords/Search Tags:Data characterization, Histogram, Wavelet synopsis, Semantic tablesummarization, Semantic graph summarization
PDF Full Text Request
Related items