Font Size: a A A

Algorithms For Extracting The Web Hidden Database And Skyline

Posted on:2018-04-03Degree:MasterType:Thesis
Country:ChinaCandidate:X ShangFull Text:PDF
GTID:2428330542497617Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays,the Internet has become one of the important symbols of the 21st century.Internet users around the world have also increased dramatically,and the information resources on the Internet have become increasingly abundant.The Internet is a huge and sharing information space,it also have those characters such as global and distributive.More and more information is stored in the backend databases of the major websites for Internet users to use.Network has become the platform on which the information be query,and at the same time,huge amounts of information was hidden in the query limited Web backend database(also known as hidden Web database),users are unable to obtain these high quality information data records effectively.Aiming at this problem,we put forward the research of this topic,which will helps users to extract useful information from mass data efficiently,and returns the usefull information to the user,providing users with convenient services.Unlike regular orders of magnitude,the extraction of large-scale data has many problems and difficulties.In particular,the problem of extraction of Deep Web data is limited by the number of free query times of the Web page and the number of return results.In the face of these problems,we need to consider the use of tools and programs,the allocation of system resources,data mining methods and techniques,and how to store access to data.The main problem is how to estimate and control the number of necessarily query,and realize the extraction of the whole hidden Web database.The current Web data mining area has taken this issue as a research hotspot.There are many ways to use to solve the Web data extraction problem.In this paper,we made depth analysis on the characteristics of the data in Hidden Web database,also carried on the thorough analysis to the existing extraction algorithm and put forward some improving,in the part of experiments verified the effectiveness and superiority of the improved algorithm.The Imain content and work of this paper include the following aspects:(1)In this paper,the hidden Web database is divided into three categories on the basis of predecessors' research,they are numerical attributes,classification attributes and hybrid attributes,and we made an in-depth study on the problem based on these three types respectively.On this basis,we study the Skyline extraction algorithm of the Web hidden database,so that we do not have to extract all the data first then to calculate its Skyline.(2)For the problem of extracting the numerical attributes in Web hidden database:on the basis of the method for dividing the spatial partition of numeric data sets,we proposed a multidimensional dynamic partitioning algorithm based on distribution,MDPA for short.(3)For the problem of extracting the classification attributes:we proposed an improved heuristic slice-cover algorithm(AHSCA),which can divide the data space that constituted by all data points belong to classification attribute.The algorithm can choose the attribute object of the next partition flexibly,there by reducing the query cost and improving the efficiency of the algorithm.Combining AHSCA and MDPA,a hybrid extraction algorithm based on hybrid attributes is proposed.(4)For the problem of extracting the Skyline of the hidden Web database,we proposed a heuristic query decomposition algorithm based on the definition of the intersection element query tree and the complete intersecting nature of the Skyline group.We structured the query tree by means of the depth first traversal or breadth first traversal,then we get the Skyline of the hidden Web database D meanwhile.(5)The validity and superiority of the above algorithms have been verified in this paper.
Keywords/Search Tags:Web data extraction, Web hiding database, AHSCA, MDPA, Skyline
PDF Full Text Request
Related items