Font Size: a A A

Research On Text Classification Of Hidden Services

Posted on:2020-09-15Degree:MasterType:Thesis
Country:ChinaCandidate:S Y HeFull Text:PDF
GTID:2428330578954779Subject:Information security
Abstract/Summary:PDF Full Text Request
With the continuous improvement of privacy requirements for network users,anonymous communication technology and hidden service mechanism(also known as the dark web)have been rapidly developed.However,the strong anonymity and hard-to-track mechanisms of the Tor hidden services provide shelter for illegal activities.The emerging illegal trading websites pose a serious threat to cyberspace security.Therefore,research on the classification of illegal activities in hidden services is of great significance to prevent and combat illegal and criminal activities.Because the URL of the hidden service is not publicly released,and the content of the illegal website is frequently changed,it is difficult to perform large-scale data collection and labeling on the hidden service.Therefore,the existing research on the classification of illegal content of hidden services has limitations such as small data set size,few target categories,and difficulty in classifying new illegal activities.In view of the above problems,this thesis proposes a classification method for illegal activities based on legal regulation,which uses relevant laws and regulations to determine whether it is an illegal hidden service.By combining TF-IDF term-weighting calculation methods and machine learning classification algorithms,this thesis can effectively classify illegal web pages in hidden services.The primary work is summarized as follows:(1)Based on research and understanding of Tor hidden service generation and access mechanism,this thesis designs a tool to discover and collect Tor hidden service addresses and website content.Using the Scrapy crawler framework,the tool implements two functions including crawling from the Tor hidden service directory sites and crawling keyword search results from the Internet search engine.This thesis then uses the collected data to construct a Tor hidden service illegal activity data set to support subsequent classification studies.(2)This thesis proposes a classification method based on legal regulation for the illegal activities of hidden services,which uses the legal regulations to make the classification basis for the illegal activities of hidden services.The focus of this method is on the extraction and construction of legal training samples.The thesis retrieves the applicable law of the target category from the HeinOnline database and then uses the FindLaw terminology database to generate a list of legal-specific stop words for filtering interference information.We use TF-IDF algorithm to extract feature words and then carry out small-scale classification test on the data set collected in this thesis,which proves the feasibility of the method.(3)In the method implementation phase,this thesis proposes a feature weighting algorithm based on TF-IDF.Aiming at the limitation of TF-IDF in web page text classification,the thesis introduce the feature weight coefficient based on HTML tag in combination with the characteristics of illegal web page structure in hidden service,which is used to improve the classification degree of feature words.The thesis constructs the legal training samples and the illegally hidden service test samples as space vector models respectively.We selected eight machine learning classification algorithms for training and classification experiments,with Bayesian classifiers performing best.The experimental results show that the classification method based on legal regulation uses TF-IDF feature weight calculation and Bayesian classifier to achieve 93.5%classification accuracy,while the improved coTF-IDF algorithm improves the accuracy by 2.6%.By comparing experiments with traditional methods on the DUTA dataset,the method uses a small-scale and easily accessible legal training set to achieve classification accuracy comparable to traditional methods.This method does not rely on large-scale hidden service training models.And for new activities that have not yet been flooded,this method can also achieve effective classification in the case of mastering legal support materials.
Keywords/Search Tags:Hidden services, Tor, Text classification, Law, Term weighting
PDF Full Text Request
Related items