| With the rapid development of network communication technology,the Internet has greater impact on real life.It is considered that the Internet consists of the surface web and the deep web.And the darknet is an important part of the deep web,which is full of sensitive content,such as drugs,guns and riots.Hidden service is the main carrier of all kinds of sensitive information and illegal content in the darknet.To discover and analyze the content of hidden service is an important and necessary task of Internet supervision.Therefore,it is of great significance to conduct research on hidden service content discovery and analysis technology.Tor anonymous network is one of the largest darknets,and is also the main research object in the field of anonymous communication and darknet.However,in terms of Tor content discovery,most of the current works rely on the single hidden service address discovery method,which fails to implement comprehensive content acquisition.In terms of Tor content analysis,there have been some achievements on content classification research of site granularity and product granularity,which is far from enough to support the supervision on the darknet.Therefore,the in-depth mining of sensitive content,including sensitive word discovery and event mining,is imminent.In view of the above problems,this thesis studies the Tor content discovery and analysis technology from the following aspects in detail:1)We study the Tor content discovery technology with hidden service address discovery and content acquisition as the core.Through in-depth analysis of the Tor hidden service protocol and source code of v2 and v3 versions,a method is designed to collect v2 version hidden service domain names by deploying intrusion nodes,and the reason why this method is not applicable to v3 version hidden service is analyzed.In addition,this thesis also uses search engines to collect hidden service addresses from the surface web,and obtains the addresses in the Tor darknet by extracting the jump links in the hidden service pages.Through the above three methods,this thesis obtains a large number of hidden service addresses and content,and carries out a statistical analysis on characteristics of Tor hidden service,such as languages and site sizes.2)We study the darknet content analysis technology,which takes content classification,sensitive word discovery and sensitive event analysis as the core.Firstly,we research the hidden service web data classification technology of single page granularity.The web data is preprocessed by data cleaning,deduplication,text data vectorization.Then the machine learning classifier is trained to implement automatic classification of web content.Based on the content classification,the sensitive word discovery technology is studied for sensitive content of particular category and obtains good results by use of the method that is based on similarity calculation and search engine verification.Finally,according to the sensitive content and sensitive words,we further use event detection,event argument extraction and event clustering to analyze sensitive events.3)Combining Tor content discovery technology with analysis technology,we design and implement Tor content discovery and analysis system.The system includes a Tor hidden service address discovery module,a Tor hidden service content acquisition module,a proxy module,a content classification module,a sensitive word discovery module,a sensitive event analysis module,a storage module and a display module.It can discover,obtain and analyze Tor content automatically.In summary,this thesis researches and implements Tor content discovery and analysis technology,and develops a prototype system.This can help relevant departments to supervise and grasp the sensitive content in the Tor darknet and provide technical support for tracking crimes in the darknet. |