Study On Automatic Extraction Of Web Data Based On DOM

Posted on:2013-02-10

Degree:Master

Type:Thesis

Country:China

Candidate:R R Du

Full Text:PDF

GTID:2248330377952155

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the development of Internet technology, the web became one of theimportant data sources of applications and research, to provide high-quality data forinformation retrieval, data mining and other fields. The important web data stored in asearchable web database only can be extended on web page accordance with thetemplate, by submitting a query, such as product information pages of ecommercesites, known as Deep Web. The deep web has large amount of data, develop rapidly,cover widely, is thematic, has a high degree of information structure, and has a highvalue. Therefore, how to extract information from the deep web and use these hugeamounts of data accurately and effectively has important practical significance andbroad application prospects.The web site on the Internet is independent, so the deep web data is very difficultto collect. In this case, the traditional search engines play a negligible role. The wayof hand-written rules can complete the information extraction, which has highaccuracy and low technical threshold, but due to the diversification of informationsources and potential risk of revision, the way of hand-written rules is unable to meetthe demand for access to information. In summary, automatic extraction of web datais a very urgent problem to be resolved. This paper gives an in-depth and systematicstudy on automatic extraction of web data, which includes the machine learningmethods of determine the query interface, automatic extraction of web data, data itemalignment, and the system of automatic extraction of web data. The specific researchwork and research results of this paper are as follows:(1) We proposed a method to find the query interface automatically based on thedecision tree model. We choose the decision tree model to classify HTML tagsaccording to the comparison and analysis of the accuracy of several classificationmodels trained by the feature set which is consisting with the characteristics of HTMLtags and generated automatically. (2) Based on the tree matching algorithm, we proposed an improved algorithm tofilter the extraction result to improve the accuracy. First, tree matching algorithm isused to extract data from list pages. The extraction accuracy is not high, because thisalgorithm just mines the repeat structure of the web page. On this basis, we proposeda filter algorithm based on entropy, and k-means clustering algorithm is used todetermine the value of entropy of the noise.(3) We proposed some alignment rules based on the partial tree alignmentalgorithm to improve the accuracy of the data alignment.(4) On the basis of above research work, we design and develop a system ofautomatic extraction of web data, which includes,1) Given multiple data sources,automatically find the query interface, and automatically fill and submit the query.2)Automatically extract the data from the list pages which are returned from queryrequests, and then filter the results to improve the extraction accuracy.3) Data recordextracted from the list pages will be aligned, and saved.4) In the case of the presenceof page navigation, the page data will be continuous extracted automatically.Paper innovation is given as follows.(1) Proposed a method to find the queryinterface automatically based on the decision tree model, use decision tree model toclassify the HTML tags automatically.(2) Proposed an improved algorithm to extractthe web data. On the basis of tree matching algorithm, we proposed the algorithmbased on entropy to filter the extraction results to achieve higher accuracy.(3) Proposed an improved data alignment algorithm on the basis of the partialtree alignment algorithm. We use some alignment rules to align data items toachieve higher accuracy.Experiments shows that the techniques we proposed can automatically andquickly extract the rich data from the list pages almost without human intervention.

Keywords/Search Tags:

Web information extraction, list page, decision tree, entropy

PDF Full Text Request

Related items

1	Research Of Data Extraction Technology Based On Tag Tree From List Pages
2	Key Technologies Research On Web Products Automatic Extraction Based On Web List
3	Research On Data Extraction Of Deep Web Based On Visual Information And Tree Match
4	Research On Web Data Extraction Based On Web Page Structure
5	Research And Implementation Of WEB Page Body Information Extraction Based On DOM Tree
6	Decision Tree Learning Based On General Entropy And Unstable Cut-points
7	Research On The Classifying Algorithm Based On Decision Tree
8	The Application Of Information Entropy In Machine Learning Algorithm
9	Research On Mining Structure Of WEB Page For Information Extraction
10	Research And Implementation Of A Web Information Extraction System Based On Semantic Structure Of The Website