In the era of information explosion,the acquisition of massive textual data related to specific domains or topics from the Internet has become a necessary task.Within domain-specific information,government announcement-related content is highly worth attention.However,due to the lack of unified planning and management in the early stages of government information technology construction,government public service websites were independently developed in various regions,resulting in inconsistencies in website architecture and backend data.As a result,information collection applications targeting government public service websites need to adapt to multiple web systems.Currently,the existing achievements in website data collection and web page information extraction mainly focus on static pages,while effective collection and extraction of dynamic pages still require manual development of collection and extraction rules,incurring significant time and labor costs.To address the aforementioned issues,an automatic traversal method for government public service websites and a web content extraction algorithmbased on heuristic rules are proposed by this thesis.Building upon these,an intelligent web information collection system is designed and implemented.The main contributions of this thesis include:(1)Addressing the issue of the crawling system’s inability to simultaneously automate the crawling of static and dynamic web pages,an automatic traversal method applicable to government public service websites is proposed in this thesis.The thesis hierarchically organizes the relevant information in web pages based on collection depth,with a specific emphasis on the importance of pagination buttons in the automated collection process.A pagination button dataset is created and features are designed.An intelligent positioning method based on XGBoost is proposed,achieving a recognition accuracy of 99.85%for pagination buttons on website pages.Experimental validation demonstrates that the crawling strategy derived from this method can adapt to various web systems and surpasses low-code collection tools in terms of collection efficiency.(2)Addressing the issue of high algorithmic complexity in extracting web page lists containing a large number of similar nodes,which previously relied on both web page HTML and visual information,a web page list information extraction method based on text features and path features is proposed by this thesis.This method parses the HTML source code of web pages into a tree structure,performs tree pruning based on semantic node attributes,and clusters and extracts information based on node text features and path features.Experimental validation shows that the proposed list page extraction algorithm achieves an extraction effectiveness of 97.46%on the government platform dataset,with an average extraction time of 0.021 seconds.(3)Based on the aforementioned work,an intelligent web information collection system is implemented in this thesis.The system adopts a B/S architecture and utilizes various technologies to ensure usability and scalability.The system is capable of automating the collection and storage of information from multiple sources without human intervention throughout the entire process. |