Font Size: a A A

Research On Improved DBSCAN Web Page Text Extraction Algorithms Based On IQABC

Posted on:2020-07-12Degree:MasterType:Thesis
Country:ChinaCandidate:H H HongFull Text:PDF
GTID:2428330572967219Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the development of information technology,tens of thousands of web pages are generated every day.In addition to providing valuable body content information,these pages are accompanied by unwanted information such as advertisements or links.On the one hand,this spam can interfere with the user's efficiency in obtaining valid information and affect the reading experience.On the other hand,the useless text contained in the spam may be used as an index key by the search engine,causing the search engine to draw a wrong conclusion and give the user Error feedback is based on DOM tree parsing-template method is the popular web page text extraction algorithm,which can complete the classification task well.Due to the frequent changes in the webpage structure of the website,it is necessary to constantly monitor the structure of the webpage,and the maintenance of the latter brings great difficulties.This paper proposes to understand the improved text-based DBSCAN web page text extraction algorithm based on IQABC(improved quickly artificial bee colony).The main work and achievements of this paper are as follows:(1)This paper proposes a new ABC algorithm,called the improved fast ABC algorithm(IQABC),which improves the population diversity while avoiding local optimality by adopting an improved roulette selection mechanism.The long change in the employment of bees consumes the best food source,balances the global and local search capabilities,and accelerates the convergence rate in the later period.The optimized IQABC-DBSCAN algorithm is obtained as the input of the DBSCAN algorithm by the global optimal parameters sought by the improved IQABC algorithm.(2)It is verified by experiments that the IQABC algorithm has faster convergence speed and better convergence accuracy than ABC and QABC algorithms.It is verified that the algorithm of text content extraction based on IQABC-DBSCAN can extract the body content of the webpage more accurately,and this paper can solve the special case of multiple texts of a single webpage through the virtual word filter,and get a good extraction effect.
Keywords/Search Tags:IQABC-DBSCAN, Artificial bee colony, Global optimization, Clustering, Text extraction
PDF Full Text Request
Related items