Research And Implementation Of Intelligent Network Information Collection System For Government Open Data Websites

Posted on:2024-05-20

Degree:Master

Type:Thesis

Country:China

Candidate:J M Gao

Full Text:PDF

GTID:2556306944461404

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In the era of information explosion,the acquisition of massive textual data related to specific domains or topics from the Internet has become a necessary task.Within domain-specific information,government announcement-related content is highly worth attention.However,due to the lack of unified planning and management in the early stages of government information technology construction,government public service websites were independently developed in various regions,resulting in inconsistencies in website architecture and backend data.As a result,information collection applications targeting government public service websites need to adapt to multiple web systems.Currently,the existing achievements in website data collection and web page information extraction mainly focus on static pages,while effective collection and extraction of dynamic pages still require manual development of collection and extraction rules,incurring significant time and labor costs.To address the aforementioned issues,an automatic traversal method for government public service websites and a web content extraction algorithmbased on heuristic rules are proposed by this thesis.Building upon these,an intelligent web information collection system is designed and implemented.The main contributions of this thesis include:(1)Addressing the issue of the crawling system’s inability to simultaneously automate the crawling of static and dynamic web pages,an automatic traversal method applicable to government public service websites is proposed in this thesis.The thesis hierarchically organizes the relevant information in web pages based on collection depth,with a specific emphasis on the importance of pagination buttons in the automated collection process.A pagination button dataset is created and features are designed.An intelligent positioning method based on XGBoost is proposed,achieving a recognition accuracy of 99.85%for pagination buttons on website pages.Experimental validation demonstrates that the crawling strategy derived from this method can adapt to various web systems and surpasses low-code collection tools in terms of collection efficiency.(2)Addressing the issue of high algorithmic complexity in extracting web page lists containing a large number of similar nodes,which previously relied on both web page HTML and visual information,a web page list information extraction method based on text features and path features is proposed by this thesis.This method parses the HTML source code of web pages into a tree structure,performs tree pruning based on semantic node attributes,and clusters and extracts information based on node text features and path features.Experimental validation shows that the proposed list page extraction algorithm achieves an extraction effectiveness of 97.46%on the government platform dataset,with an average extraction time of 0.021 seconds.(3)Based on the aforementioned work,an intelligent web information collection system is implemented in this thesis.The system adopts a B/S architecture and utilizes various technologies to ensure usability and scalability.The system is capable of automating the collection and storage of information from multiple sources without human intervention throughout the entire process.

Keywords/Search Tags:

information collection, web crawler system, crawling strategy, webpage information extraction

PDF Full Text Request

Related items

1	Research On Criminal Regulation Of Data Crawling
2	Research On Anti-unfair Competition Law Regulation Of Crawler’s Data Crawling Behavior
3	An Empirical Study On The Crime Of Infringing Citizens’ Personal Information By Using The Internet Web Crawling
4	Research On The Administrative Regulations Of The Application Of Web Crawler Technology
5	Disscussion And Analysis On Legal Issues Of Webpage Copyright
6	Research On Information Extraction Algorithm For Legal Text
7	Design And Implementation Of Text Information Extraction And Classification Statistics System For Judgment Documents
8	Research On Information Extraction Algorithm For Judgment Document
9	The Thought Of Criminal Law On Data Crawling By Web Crawler Technology
10	Research On Criminal Law Regulation Of Web Crawler Crime