Font Size: a A A

Research On File Semantic Analysis In Large-Scale File System

Posted on:2012-07-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:P XiaFull Text:PDF
GTID:1118330368984112Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
File Semantic is the study of meaning that can be used to infer file system user behav-iors. It has become an increasingly important practice in both engineering and research com-munity of file system design and implementation. Comparing with block semantic whose manifestations are only in the form of data blocks common access locality (temporal or spa-tial), file system can provide more useful and insightful information about file semantic due to the elaborate and rich I/O interfaces between upper layer applications and file system-s. Unfortunately, it is challenge to explore semantic knowledge in file systems effectively and accurately because a variety of factors could affect this knowledge exploration process. Examples of semantic factors include user/program behavior, storage organization and file data types. Even worse, the challenges are exacerbated due to the intricate interdependency between these factors, make it difficult to fully exploit the potentially important correlation among various semantic knowledge that in turn may reveal more accurate file correlations.This article proposes a approach to measure inter-file relationships, called FARMER. In this approach, file is treated as a multivariate vector space, and each item within the vector corresponds a separate factor of the given file. The selection of factor depends on the application, typical factors are file name, creator and executing program. If one particular factor occurs in both files, its value is non-zero. We believe that the extent of inter-file relationships can be measured based on the likeness of their factor values in the semantic vectors. Benefit from factor vector model, FARMER represents files as structured vectors of identifiers, and basic vector operations can be leveraged to quantify file correlation between two file vectors. FARMER file correlation evaluation model plays a fundamental role for file semantic quantitative analysis. Experiment result shows that FARMER can evaluate file correlations accurately and effectively.This article also proposes a file correlation regression analysis model, called CoMiner, which models the relationship between interested file system phenomenons and various in-fluencing factors in accordance with observation sample data. CoMiner regression analysis helps one understand how the typical value of dependent variable changes when any one of the independent variable is varied, while the other independent variables are help fixed. The model is also target for fitting the function relationship between two file system variables by using non-spline and spline regression. CoMiner provides a flexible and scalable regression model combinations so that we can accurately forecast with a reasonable algorithm cost. More significantly, the prefetching algorithm is shown to reduce the metadata latency by ap-proximately 20% when compared to a state-of-the-art metadata prefetching algorithm and a commonly used replacement policy, resulting from integrating the strength coefficients estimated by CoMiner into the FARMER model.Further, this article propose a time series analysis model, called TiMiner, to identity the patterns in time series file system data. Experience with studying file system show that file system activities can expose an internal structures as the time changes. Therefore, it is necessary to study file system semantic with time dependent fashion. Basing on the analysis of practical file system acclivities, five characteristics of file system time series are conclud-ed and the corresponding studying approaches are also proposed. The five characteristics include trend, seasonality, outlier, heteroscedasticity and non-linear. It shows that the file system cache hit ratio can be separated into three parts:the first one is the autocorrelated s-tatus in the past, the second depends on the incoming requests distribution between the time interval, the last part is the time series difference. Experiment result shows that TiMiner model can be perfectly calibrated with historical data and present a reasonable good fore-casting.To demonstrate ability of FARMER, CoMiner and TiMiner, we incorporate these models into our large-scale object-based distributed storage system-Cappella to validate our design goal that all of them are powerful tools to infer complex file correlations. Exper-imental results shows that FARMER/CoMiner can accurately mine file correlations. More significantly, CoMiner-enabled prefetching algorithm is shown to improve the metadata server cache hit ratio by approximately 10-45% when compared to several metadata prefetching algorithms and the commonly used replacement policy. CoMiner-enabled data layout algorithm is shown to ameliorate the object store device throughput about 5% while competing with our earlier state-of-the-art algorithm.
Keywords/Search Tags:File System Semantic, File Correlation, File Correlation Regression Analysis and File System Time Series Analysis
PDF Full Text Request
Related items