Font Size: a A A

Research On Web Database Extraction Based On Formal Concept Analysis

Posted on:2012-03-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z ZhangFull Text:PDF
GTID:1118330344451682Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Web-based applications almost need acquire data encapsulated in Web pages. However, with the increasing development of Internet, more and more Web pages are dynamically generated by programs running at Web servers. The tasks to search such deep Web pages and to extract their data become more difficult than do from static Web pages. Online Web databases constitute an important component of deep Web. Because they own structured data, Web pages returned by queries to Web databases are more conducive to be extracted data. Data extraction from Web databases is an important topic of researches about deep Web. And some scholars have conducted in-depth studies about them. While, there is still a lack of issues for data extraction from Web databases which limit the size of query results in web pages. This paper illustrates and analyzes at length the problem of extracting data from limited Web databases by the formal approach of Formal Concept Analysis (FCA). And major work and innovation are elaborated in the following aspects:(1). Primarily proving the relation among all couples composed by a query and its result is tolerance, and then proving the set consisted of them is a complete lattice which is homomorphism to the concept lattice derived from same source. Consequently, Such problem that extracting data from limited Web databases is transformed into the problem of FCA-based application. The order relation be-tween concepts could be utilized to describe correlation between queries. And the intent of formal concept can be considered as a query; query result size is forecasted by cardinality of the concept extent.(2). A series of limited Web database extraction algorithms have been pro-posed, in order to improve the efficiency of applications based on concept lattice, during the procedure of trying Concept Lattice to the task of extraction data from limited Web database. These algorithms are respectively the algorithm of data ex-traction from limited Deep Web based on latticial space which is called Ladeldew and presented with perspective of concept set covering, the extraction algorithm based on subposition assemly construction of semi-lattice space, and the maxi-mum subconcepts-based extraction algorithm called EdaliwdbFCA which is put forth from the view of Information Retrieval (IR).(3). It is well known that an efficient way to decrease time and space complexity of constructing concept lattice in FCA-based task is only construct the needed part of concept lattice. Consequently, this paper proposes a theory framework for subposition assemly construction of lower semi-lattice, based on which an algorithm called Nocose has been presented. Nocose avoids the construction of complete concept lattice. Whereafter, a method for generating concept's lower cover is be given. This method dynamically generates current query concept's lower cover which is considered as search space during above data extraction tasks, so that the method further avoids the construction of semi-lattice. Both methods not only reduce the compute complexity of FCA-based applications, but also lie a solid theoretical foundation for the practice of FCA-based applications.(4). With the aim of further processing Web data which has such properties of large-scale, dynamic, heterogenous, repeatability and conflict etc, as well as to maintain theoretical methods consistency, this paper also proposes a FCA-based concept fusion theoretical framework by studying relations of conflict, comple-mentarity and abstract etc. between concepts with the help of formally conceptual representation and conceptual analysis of FCA. Finally, a algorithm called Acorn for mining associative concepts in domain Web pages is presented based on above fusion framework.(5). All works developed in this paper are theoretical verified and empirical tested, and experimental results show theoretical correction and feasibility of prac-tical application. In addition, each algorithm has its own performance testing and analysis according to its corresponding characteristics.These researches in this paper not only has theoretical significance, but also a wide range of practical applications. On the one hand they enrich the theoretical studies about web information extraction and conceptual fusion; On the other hand extend scope of application of the concept lattice, And provides new ways for Web information extraction and fusion. But still a large number of theoretical issues and specific application problems look forward to be solved in future. This is a long-term and arduous works at which we need to persevere.
Keywords/Search Tags:Web database, Formal Concept Analysis, Concept Lattice, Data Extraction, Concept Fusion
PDF Full Text Request
Related items