Font Size: a A A

Research And Application On The Technology Of Web Information Extraction Based On The HTML

Posted on:2012-12-15Degree:MasterType:Thesis
Country:ChinaCandidate:L Y YuFull Text:PDF
GTID:2218330368482207Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the advances of technology and the continuous development of Internet, the Internet began to become an important medium for modern people to understand the world, At the same time the information is updated on the network every day,How can get the news and media articles and other informations volume and accurate and turn it into structured records? Web information extraction is expected to help people solve these problems, but the existing information extraction technology is mostly established on semi-automated fashion, the establishment of the extraction Rules requires a lot of labor involved, the extraction techniques of this problem will be researched with Web news pages.Positioning under the landing page url and keyword is completed firstly in this thesis. Then news list page informations and news content of the text pages are extracted. With the thought of artificial intelligence, some existed Web information extraction technology are proposed. The following solutions are as follows:1. It is difficult to shape a general model that from the home page navigate to the target page step by step,so the semantic similarity algorithm of text clustering is considered and applied to this process.The similarity is calculated between keyword which the users input to describe target page categories and column name in the navigation bar at all levels,a model that detecting and landing page automatically is built, so this process can be achieved automatically.2. In order to analyze and study the extraction of news and information list page better, the XML format page will be converted into HTML format and get the XPath path of every node. Then the BP neural network model will be established by using some neurons information which takes some characterisitcs of the news information list page as input layer. The optimal solution path of information to be extracted can be obtained by training samples, and finally implementing the extraction of information list page.3. Starting from the xml document which was converted from news page,taking the full use of each line's text density ratio that is a remarkable feature to establish BP neural network model, depending the excellent machine learning ability of neural network model to achieve the perfect combination of the Statistical thinking and the information extraction, and the extraction rules of the text pages will be built on it's base.4. By designing and building a prototype information extraction system, some typical news sites of domestic will be selected to test the research, verify the extraction performance and finnally optimize algorithms.
Keywords/Search Tags:Web Information Extraction, HTML, XPath, Vocabulary Similarity, BP Neural Network
PDF Full Text Request
Related items