Font Size: a A A

Design And Implementation Of Web Topic Information Acquisition System

Posted on:2012-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:L LianFull Text:PDF
GTID:2218330368998220Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Abstract: Nowadays the Internet has become the fastest and most convenient way of transmitting information. Every day on the Internet numerous websites and WebPages are coming into being. Users can find information through search engines, but they can not easily find their specific sets of website content through search engines, because the search engines service for all Internet users, and they will not set these sites content to the user initiatively.In order to solve this problem, the author designs this web information collection system. By using this system, the operator can very easily search the news they need and integrate it, which greatly makes the access to news and information from the Internet more targeted and convenient.Under the guidance of function of the system for gathering information, this paper elaborates the design and implementation process of the web information collection system from requirements analysis, theoretical research, system architecture, working principles and function realization.In this paper, the author first studies the theories about the web information extraction and text mining, and proposes the design principles of the web information collection system and system-building goals. Then, the author analyzes the characteristics of web pages, summarizes the rules, designs the extraction techniques of the webpage source code feature, ultimately formalizes the web information collection methods and brings up the working principles and system design of it. The system is divided into the following three elements: page customization-- according to the matching between the websites of the information needed and the regular expression rules, the collection rules are obtained; web information acquisition-- according to access to information, rational use of extraction algorithm, and updating matching information regularly and increasingly, the effective information will be stored in the repository; web content management--to delete, add, query, modify, information etc. that is, to design information management system, which mainly refers to the management of the access to the contents of the completed data. This system, based on systematically test and demonstration of the large DOMestic portal information, demonstrates the advantages of this method, as well as user-friendly features, and also shows the meanings, ranges and prospects of the application of this method.
Keywords/Search Tags:symmetrical matching, information gathering, page reading
PDF Full Text Request
Related items