Font Size: a A A

Illegal Webpage Detection Model Based On Improved TF-IDF Algorithm

Posted on:2022-05-19Degree:MasterType:Thesis
Country:ChinaCandidate:K Y LiFull Text:PDF
GTID:2518306476990639Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,people can easily obtain massive amounts of information and resources from all kinds of web pages.The way of people living and pruducing is getting more and more closer to the Internet.However,the scale of the network is gradually expanding,and the network technology is gradually developing while using the Internet.,Incidents involving black and gray products are also happening constantly at the same time.One of the common methods is to create bad web pages to induce the viewer to click,thereby gaining the privacy of the viewer or introducing the viewer into a scam.As there are millions of webpages,and it can be difficult to manually detect bad webpages.And most of the existing bad website detection methods are mainly based on a single body text,and do not conside the page title,pictures and other webpage elements.Therefore,this article proposes a bad website detection model based on TF-IDF algorithm that can distinguish different sources of text and assign different weight values.And a pornographic picture recognition module is added to judge the pictures of the webpage.The TF-IDF algorithm is commonly used as classification algorithm in the field of text classification.It uses the calculated weight value to estimate the important algorithm principle of a single vocabulary for a certain text or a certain corpus.The core idea of the TF-IDF algorithm is that the importance of a word in the text to the content of the text is proportional to the frequency of the word in the text.The bad webpage detection model based on the improved TF-IDF algorithm is mainly divided into the following tasks:1.A pornographic picture detection model was established,and used the method of detecting skin color to determine whether the picture is a pornographic picture.The work content included training and research on the recognition of skin color pixels in the picture,the method of determining the human body rectangle,and the judging standard of pornographic pictures.2.On the webpage,there can be picture witch have text on it,this kind of text has a certain meaning for the judgment of the content of the webpage,so a method of text recognition in the picture was researched,and the work content included the research of text positioning and text recognition in the picture.3.The texts in the webpage are all appeared as the form of paragraph,so it is necessary to segment and match the text.The work content was mainly about the study of word segmentation methods,the establishment of stop words and sensitive word lexicons,and the study of word matching methods.4.After the text paragraph was splited and the stop words was removed,too,it should calculate the weight value of each word.As the text in the webpage can appear in different place,the word weight value generated by the text should be different,so the TF-IDF algorithm was improved,as the three source words in the title,the text in the picture,and the text in the webpage were calculated according to different weight value calculation methods.5.Research on the judgment method: the output content of each sub-module was summarized and calculated,and the judgment result of whether the target webpage is a bad webpage or not will be the output.Through the evaluation based on the improved TF-IDF algorithm,it can be found that in the recognition of bad webpages,the recognition rate can reach 85.8%,in the recognition of bad webpages and normal webpages,the accuracy reaches 0.9905,the recall rate reaches 0.9413,and the F-score value reaches 0.2413,the traditional methods for detecting bad web pages have improved performance.
Keywords/Search Tags:TF-IDF, word segmentation matching, text recognition, image recognition
PDF Full Text Request
Related items