Font Size: a A A

Design Of Web Sensitive Word Filtering System Based On Decision Tree

Posted on:2019-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:W LiFull Text:PDF
GTID:2428330569486989Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The traditional system of webpage sensitive word filtering is mostly realized by comparing text files based on the database.This method has poor real-time performance and low efficiency in filtering sensitive words.It takes a lot of time and effort for network management personnel.This paper takes the initiative to match and filter the sensitive words in the webpage.It realizes the active matching and filtering of the sensitive words in the webpage before uploading it to the server,and uses the decision tree method to classify the webpage text containing sensitive words.The main contents of the paper as follows:(1)Design and implement a dictionary tree based webpage sensitive word matching and filtering method.Using the Beautiful soup module in Python to parse the web page into a DOM(Document Object Model)document object model,and then extract the text content of the web page.The method of retrieval and matching of sensitive words in texts is studied.A method of matching and filtering webpage sensitive words based on dictionary tree is designed,which improves the correctness and recall rate of sensitive word filtering in webpage texts.(2)A sensitive text classifier based on decision tree is designed.Through text preprocessing,the training set and test set of sensitive text classification were constructed.The vector space model of sensitive texts was constructed by using the Chinese word segmentation system.The TF-IDF value of the word vector in the text set vector space was calculated to obtain the weight matrix of the training set and the test set.A decision tree C4.5 algorithm was used to build a sensitive text classifier.(3)Using Python language to achieve the text content extraction,text preprocessing and sensitive text classification.For the problem that there are too many disturbing items in the webpage text,regular expressions are used to remove special characters in the text and convert traditional Chinese into Simplified Chinese.the webpage text is preprocessed.By adding sensitive words in different categories of texts,sensitive text training samples are provided,feature values are extracted,a decision tree is constructed,pruning conditions are set,and samples are classified.
Keywords/Search Tags:network security, text processing, information filtering, sensitive words, decision tree
PDF Full Text Request
Related items