Font Size: a A A

Research On Garbage Image Filtering Method Based On Image Feature And OCR

Posted on:2018-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:S J YuanFull Text:PDF
GTID:2358330512478767Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,e-mail has become an important tool for people's daily communication.More and more People receive a lot of useful information through e-mail,but they also receive a lot of advertising,pornography,fraud,Trojans and even reactionary content.These bad content takes up lot of network resources,increases the risk that user faces and reduces the user experience,and we call these as spam.Currently,the spam which in the form of text has developed into a new type mixing up the image and text.In the past,there were many researches on spam filtering methods for text,while the spam filtering method for images is still unsatisfactory.This paper focuses on spam image filtering technology.In this paper,a two-layer spam image filtering method is designed.By using the basic features of image and OCR recognition,this method improves the detection rate and reduces false detection rate.According to the characteristics of the type,filtering method can be divided into feature-based filter layer and content-based filter layer.The former is the first layer filtering,which belongs to the coarse classification,and the spam image is preliminarily screened by the basic features of the image.The latter is the second layer filter,which belongs to the subtle classification,and we can get the classification of spam via identifying the content of the keywords which are extracted from the spam image.In the feature-based filtering layer,a KNN filtering method is proposed,which is based on confidence analysis.Firstly,the features of spam images and ham images are analyzed which include the color,gradient and HOG.Then the KNN classification results and confidence distributions are analyzed,and the fusion of the multi-feature classification results is achieved by the confidence degree to reduce the error rate.In the content-based filter layer,a special method is designed to detect,segment and distinguish the text in the image,and designs a word segmentation method based on Fourier and projection for the text tilt problem in the spam image.Then,the chi-square test method is used to extract the feature of keyword from the text and reduces the probability that the low-frequency word is selected as the feature.Finally,a short text classification method which based on SVM and a priori corpus is designed to further classify the spam image as crime,Insurance and commodity promotion.Compared with the SPAM common image set and the collected image set,the result shows that the two-layer spam image filtering method has better accuracy and lower error rate.
Keywords/Search Tags:spam image, feature extraction, KNN, short text classification
PDF Full Text Request
Related items