Font Size: a A A

Research On Onion Address Collection And Hidden Service Content Classification

Posted on:2022-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:S R YingFull Text:PDF
GTID:2518306605466964Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
Tor is the one of the most active annoymous network with the largest number of users.The hidden service provided by tor has perfect annoymous communication mechanism,which makes it becoming the carrier of crime activities,bringing danger and threat to network security and social stability.Therefore,analysis and research of the Tor hidden service is crucial and practical.Because the onion addresses needed to access hidden services are not exposed on a large scale,in order to analysis hidden service,firstly we need to harvest a large number of onion addresses.In the light of low efficiency of onion addresses collection,a harvest structure based on docker is proposed;In order to solve the problem,such as weak representativeness of features,as well as the affect of interference words,we put forward feature vector extract method,which is helpful for multi-classification of hidden service content.The specific work includes following aspects:1.With regard to the long time and low efficiency of hidden services collection.Using docker virtualization technique to extend the onion addresses and hidden service collection framework.Based on two factors to allocate the task of docker containers in order to narrow the time consumption gap of data collection.Further shorten the time consumption of onion addresses and hidden services collection.Compare with traditional multithreading method,the proposed task allocation method based on two factors reduces the collection time by 61.9%.2.Considering the characteristic of hidden service content.We proposed the hidden service feature vectors extracting method.Firstly,Collecting the feature words for each category,then based on the content and title of hidden service,according the TF-IDF weight or frequency of words,extracting the representative words.In accordance with feature words set process representative words,only words with strong relevance to category are considered.So that the useless words are eliminated,forming more representative feature vectors;In order to use the semantic context information of words,we use the word embedding model to express the words,and then the hidden service content is represented.Finally we obtain the feature vectors containing semantic information.Using feature fusion method to fuse feature vectors extract by different methods to form more representative feature vectors.The characterization and vectorization of hidden service content are realized.3.We proposed the hidden service classification method based on machine learning,With regard to the processed hidden service content and title,Using the proposed feature vector generate method to form different feature vectors.Using serial feature fusion method,fusing different feature vectors obtained by distinct method to form different training sets.The training sets is used as input of machine learning algorithm to train the classifiers.The parameters of better classifiers are further adjusted using grid search.Finally,using ensemble learning technique fuse better classifiers in order to further improve the classification accuracy and generalization ability.The accuracy of fused classifier reaches 96.8%.
Keywords/Search Tags:The Second Onion Router, Collection of onion addresses, Analysis of hidden service content, Classification of content, Machine learning
PDF Full Text Request
Related items