Font Size: a A A

Research On Data Fragment Type Classification Based On Machine Learning

Posted on:2016-09-16Degree:MasterType:Thesis
Country:ChinaCandidate:J L WangFull Text:PDF
GTID:2308330467482328Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
When performing digital forensics, intrusion detection and reverse engineering, peopleoften encounter some unknown data or file fragments. Identify the data’s data type or the filefragments’ file type is a critical issue. The existing methods are coarse grained approaches,resulting in a lower accuracy, especially for compound file type’s fragments. So we promote adata fragments classification method based on machine learning.Firstly, we analyze and summarize the existing data fragments classification methods anddiscuss their advantages and disadvantages. There are fragments classification methods basedon similarity measurement, methods based on machine learning and methods by using imageclassification technology. Those methods try to infer a fragments’ file type, so it’s a coarsegrained way. The classification results are poor, especially for compound file type’sfragments.Secondly, to solve the problem of coarse grained, we give a more precise definition onthe fragments classification problem, and promote a data fragment classification methodbased on machine learning, in which the data type is used. The method first gets the commonused data types and builds a data set by analyzing the common file format. Then the data set isdivided into two part and the fragment features are extracted. After that a machine learningalgorithm is used to build a classifier by using the data in the training set. In the end thetesting set is fed into the classifier and its ability is measured. In our experiments, we testsome kinds of algorithm and compare their abilities. Comparing with the methods using thefile type, our approach gets20%increment.Lastly, we try to identify the PPT fragments by using the promoted method. The PPT filetype is one kind of compound file format, so its fragments’ classification accuracy is low byusing file type methods. We parse the file format and find its common data types, then findthe reason why its fragments are difficult to identify. In our approach, we classify the PPTfragments using data type rather than file type. The experiments results show that our methodis promising, which gets52%increment in accuracy.We try to solve the fragment classification problem by using data type rather than filetype, in which the machine learning algorithms are used. Our approach not only improves theaccuracy of classification, but also reduced the classification granularity.
Keywords/Search Tags:Data Fragment, Data Type, File Type, Machine Learning, PPT Fragment
PDF Full Text Request
Related items