Font Size: a A A

Study On Automatic Extraction Of Web Data Based On DOM

Posted on:2013-02-10Degree:MasterType:Thesis
Country:ChinaCandidate:R R DuFull Text:PDF
GTID:2248330377952155Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of Internet technology, the web became one of theimportant data sources of applications and research, to provide high-quality data forinformation retrieval, data mining and other fields. The important web data stored in asearchable web database only can be extended on web page accordance with thetemplate, by submitting a query, such as product information pages of ecommercesites, known as Deep Web. The deep web has large amount of data, develop rapidly,cover widely, is thematic, has a high degree of information structure, and has a highvalue. Therefore, how to extract information from the deep web and use these hugeamounts of data accurately and effectively has important practical significance andbroad application prospects.The web site on the Internet is independent, so the deep web data is very difficultto collect. In this case, the traditional search engines play a negligible role. The wayof hand-written rules can complete the information extraction, which has highaccuracy and low technical threshold, but due to the diversification of informationsources and potential risk of revision, the way of hand-written rules is unable to meetthe demand for access to information. In summary, automatic extraction of web datais a very urgent problem to be resolved. This paper gives an in-depth and systematicstudy on automatic extraction of web data, which includes the machine learningmethods of determine the query interface, automatic extraction of web data, data itemalignment, and the system of automatic extraction of web data. The specific researchwork and research results of this paper are as follows:(1) We proposed a method to find the query interface automatically based on thedecision tree model. We choose the decision tree model to classify HTML tagsaccording to the comparison and analysis of the accuracy of several classificationmodels trained by the feature set which is consisting with the characteristics of HTMLtags and generated automatically. (2) Based on the tree matching algorithm, we proposed an improved algorithm tofilter the extraction result to improve the accuracy. First, tree matching algorithm isused to extract data from list pages. The extraction accuracy is not high, because thisalgorithm just mines the repeat structure of the web page. On this basis, we proposeda filter algorithm based on entropy, and k-means clustering algorithm is used todetermine the value of entropy of the noise.(3) We proposed some alignment rules based on the partial tree alignmentalgorithm to improve the accuracy of the data alignment.(4) On the basis of above research work, we design and develop a system ofautomatic extraction of web data, which includes,1) Given multiple data sources,automatically find the query interface, and automatically fill and submit the query.2)Automatically extract the data from the list pages which are returned from queryrequests, and then filter the results to improve the extraction accuracy.3) Data recordextracted from the list pages will be aligned, and saved.4) In the case of the presenceof page navigation, the page data will be continuous extracted automatically.Paper innovation is given as follows.(1) Proposed a method to find the queryinterface automatically based on the decision tree model, use decision tree model toclassify the HTML tags automatically.(2) Proposed an improved algorithm to extractthe web data. On the basis of tree matching algorithm, we proposed the algorithmbased on entropy to filter the extraction results to achieve higher accuracy.(3) Proposed an improved data alignment algorithm on the basis of the partialtree alignment algorithm. We use some alignment rules to align data items toachieve higher accuracy.Experiments shows that the techniques we proposed can automatically andquickly extract the rich data from the list pages almost without human intervention.
Keywords/Search Tags:Web information extraction, list page, decision tree, entropy
PDF Full Text Request
Related items