The Research Of Focused Web Information Extraction Technology With Sparse Sample

Posted on:2015-09-27

Degree:Master

Type:Thesis

Country:China

Candidate:L H Luo

Full Text:PDF

GTID:2298330422989408

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

As the widespread consensus of information explosion in recent years, it isunable to rely solely on simple information retrieval to obtain useful informationfrom massive data. Hence the Web information extraction has become a central issue.Domestic and overseas researchers have proposed a variety of solutions to deal withthe massive web data and the heterogeneous characteristics problems. But there isstill not exist an effective web information extraction system which able to adapt to avariety of different information extraction tasks.Based on the current status, a web information extraction method for sparsesample set is proposed in this paper. By this method, users can extract specificinformation from abundant of similar structural pages as the sparse samples theygave before, usually a set with a single sample. The main content of this paper is asfollows:(1) Usually, it is necessary to download the full page content to compute thestructural similarity between two pages. It is obviously difficult to meet therequirement of capture giant similar webpages. In this paper, the feature of pagesâ€™URL is fully used to compute the approximately similarity between pages. By thisfast structural capturing algorithm, a focused crawler was designed to providepractical data resource for the web information extraction followed.(2) A model for the features of DOM tree node and a method to featureselection for the web information extraction under sparse sample set are proposed.The main idea is that the DOM tree node could be described by its neighbors. Theatomic features and the structural features which are determined by its parent nodes,child nodes and sibling nodes of a node, are selected by optimized user-orientedmethod based on statistical measure and usersâ€™ interests. Information is thenextracted by feature comparison method.(3) A web information extraction system with sparse sample set is implemented.Users only provide a sample page and the system can find structural similar pagesthen extract appropriate content for users automatically. This system can be appliedto a variety of information extraction tasks and provide cross-domain (cross-website)sorting, searching and other structured information services.

Keywords/Search Tags:

Web information extraction, sparse example, featuremodel, feature selection

PDF Full Text Request

Related items

1	Research On Image Classification And Feature Extraction Algorithms Under Sparse Constraints
2	Study Of Graph-based Feature Extraction And Feature Selection With Their Applications
3	Research On Polarization Radar Sparse Imaging And Feature Extraction Method
4	Research On Graph Regularized And Discriminant Information Based Feature Selection
5	Research On Dimensionality Reduction Algorithm Based On Reconstruction Information Preservation
6	Study On Model And Algorithm Of Dynamic Feature Fusion Based On Information Sources Selection And Sequential Extraction
7	Research On Feature Selection Algorithms Based On Structure Information Of Samples And Features
8	Research On Feature Selection Algorithm And Its Application In Image Recognition
9	Research And Improvement Of Feature Selection Algorithms Based On Sparse Learning
10	Embedded Unsupervised Feature Selection Based On Sparse Learning