| In recent years,the development of the internet has greatly promoted information sharing.Individual users are empowered to freely disseminate their perspectives,experiences,and knowledge via the Internet,in addition to obtaining the information required for daily life and learning.Corporations and governmental entities can harness the power of data mining to glean insights from information available on the Internet to guide their decision-making processes.The substantial volume of data generated by the Internet also provides significant backing for research in deep learning and multimodal studies.Amidst the proliferation of content creators producing valuable information on the internet,there exists a category of websites,termed "content farms",which rapidly generate an extensive volume of low-quality content through the use of content generators or web crawlers for purposes such as profit-making or manipulation of public opinion.Despite the inferior quality of information on content farms,these websites continue to appear in search engine results due to the site operators’ employment of search engine optimization techniques.To mitigate the interference posed by the low-quality content from these sites to search engine users,this thesis has developed a content farm filtering system capable of real-time filtering of content farm pages in search engine results.The main contributions of this paper are summarized as follows.1)This thesis summarizes the three common types of content farms and their characteristics,and designs the overall architecture of a content farm filtering system that can perform real-time analysis and filtering of results returned by search engines.The two core subsystems of this system,the page content extraction system and the content farm identification system,are developed and implemented.2)This thesis summarizes three types of pages based on their content and layout features.On the premise of the classification of pages in this thesis,a composite extraction method which aims at extracting textual information from a randomly given page is proposed.This method first determines the page type,and then adopts a specific extraction method based on the page type to extract the required information from the page.3)On the premise of the classification of content farms in this thesis,a composite method capable of identifying three common types of content farms is proposed.In this method,text similarity comparison is used to identify plagiarism content farms.Keyword density and the high-frequency presence of many simple questions containing keywords are used to identify the keyword-stuffing content farms.Finally,a BERT-based text classification method is used to identify low-quality machine-translated content farms. |