Illegal Webpage Detection Model Based On Improved TF-IDF Algorithm

Posted on:2022-05-19

Degree:Master

Type:Thesis

Country:China

Candidate:K Y Li

Full Text:PDF

GTID:2518306476990639

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet technology,people can easily obtain massive amounts of information and resources from all kinds of web pages.The way of people living and pruducing is getting more and more closer to the Internet.However,the scale of the network is gradually expanding,and the network technology is gradually developing while using the Internet.,Incidents involving black and gray products are also happening constantly at the same time.One of the common methods is to create bad web pages to induce the viewer to click,thereby gaining the privacy of the viewer or introducing the viewer into a scam.As there are millions of webpages,and it can be difficult to manually detect bad webpages.And most of the existing bad website detection methods are mainly based on a single body text,and do not conside the page title,pictures and other webpage elements.Therefore,this article proposes a bad website detection model based on TF-IDF algorithm that can distinguish different sources of text and assign different weight values.And a pornographic picture recognition module is added to judge the pictures of the webpage.The TF-IDF algorithm is commonly used as classification algorithm in the field of text classification.It uses the calculated weight value to estimate the important algorithm principle of a single vocabulary for a certain text or a certain corpus.The core idea of the TF-IDF algorithm is that the importance of a word in the text to the content of the text is proportional to the frequency of the word in the text.The bad webpage detection model based on the improved TF-IDF algorithm is mainly divided into the following tasks:1.A pornographic picture detection model was established,and used the method of detecting skin color to determine whether the picture is a pornographic picture.The work content included training and research on the recognition of skin color pixels in the picture,the method of determining the human body rectangle,and the judging standard of pornographic pictures.2.On the webpage,there can be picture witch have text on it,this kind of text has a certain meaning for the judgment of the content of the webpage,so a method of text recognition in the picture was researched,and the work content included the research of text positioning and text recognition in the picture.3.The texts in the webpage are all appeared as the form of paragraph,so it is necessary to segment and match the text.The work content was mainly about the study of word segmentation methods,the establishment of stop words and sensitive word lexicons,and the study of word matching methods.4.After the text paragraph was splited and the stop words was removed,too,it should calculate the weight value of each word.As the text in the webpage can appear in different place,the word weight value generated by the text should be different,so the TF-IDF algorithm was improved,as the three source words in the title,the text in the picture,and the text in the webpage were calculated according to different weight value calculation methods.5.Research on the judgment method: the output content of each sub-module was summarized and calculated,and the judgment result of whether the target webpage is a bad webpage or not will be the output.Through the evaluation based on the improved TF-IDF algorithm,it can be found that in the recognition of bad webpages,the recognition rate can reach 85.8%,in the recognition of bad webpages and normal webpages,the accuracy reaches 0.9905,the recall rate reaches 0.9413,and the F-score value reaches 0.2413,the traditional methods for detecting bad web pages have improved performance.

Keywords/Search Tags:

TF-IDF, word segmentation matching, text recognition, image recognition

PDF Full Text Request

Related items

1	Research And Application On Chinese Automatic Word Segmentation In Full Text Retrieval
2	Product Information Matching System Based On Image Recognition And Text Classification
3	Layout Analysis And Recognition Of Graphic And Mixed Images
4	Image/Video Text Extraction And Its Application
5	Research On Chinese Word Segmentation Algorithm Based On News Text
6	Research On Text Detection And Recognition In Complex Natural Scene Image
7	Study On Chinese Text Similarity Computing Based On Word Segmentation
8	A segmentation-free approach to text recognition with application to Arabic text
9	Video Text Extraction Technology Research And Application
10	Image Matching For Depth Recovery And Shape Recognition