Font Size: a A A

Based On Hybrid Genetic Annealing Algorithm For Web Information Extraction Method

Posted on:2010-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:F XiaoFull Text:PDF
GTID:2208360275483043Subject:Software engineering
Abstract/Summary:PDF Full Text Request
There is a great tendency to rely on the network information as the booming development of network technology. The web, as a source of immense information, is a huge database which contains a variety of valuable information. It's a distinctly fascinating research direction to how to extract potential but useful things from it. Web information extraction is a technology using data mining that discovers and extracts information and knowledge automatically from the web documents and services. It is one of the significant methods that accelerates searching and improves the accurancy of searching in the network information process. This thesis introduces the basic knowledge of web information extraction technology, as well as the current state of webinformation extraction technology at home and abroad.Then using the Hidden Markov Model to extract web information. In the process, first, this thesis talks about the formulation of the Hidden Markov Models and its typical algorithms. Second, it makes use of Hidden Markov Model to extract the specific information of the headline from the marked training data sets.With regard to the un-marked training data sets, the Hidden Markov Model can be optimized by genetic algorithm because of the Model's sensitiveness to initial parameters.Due to genetic algorithm is easy to prematurely converge, so this thesis bring in another kind of optimization algorithm—the Simulated Annealing Algorithm together with Hidden Markov Models, to find out optimal Hidden Markov Model initial parameters.It presents the whole framework of web information extraction based on SA-HMM and compares experimental results of two optimization algorithm with each other. To reduce the influence of existing problems in these two methods during the recognising process and overcome the defects of the two optimization algorithms and improve the efficiency of the system, its advisable to use hybrid genetic annealing algorithm in selecting initial model parameters of the global optimal solution .Through the analysis of experimental results, web information extraction methods based on GA - HMM and SA - HMM are both extremely effective.Because of the comprehensive advantages of the two kinds of optimization algorithm, the experimental results that using the method of web information extraction based on hybrid genetic annealing algorithm - a Hidden Markov Model better than the two former methods.
Keywords/Search Tags:Web Information extraction, Hidden Markov Model, Genetic Algorithm, Simulated Annealing Algorithm, Hybrid genetic/simulated annealing algorithm
PDF Full Text Request
Related items