Web Text Information Extraction And Classification

Posted on:2015-03-31

Degree:Master

Type:Thesis

Country:China

Candidate:W N Li

Full Text:PDF

GTID:2298330434964991

Subject:Computer application technology

Abstract/Summary:

With the rapid development of Internet technology and the explosive growth of networkinformation, it becomes very difficult for the users to find the required information from themass information accurately and quickly, and this situation drives the research on Webinformation extraction technology. Hidden Markov Model (HMM) attracts increasingattention of researchers because of its advantages of easy establishment, strong flexibility andhigh extraction accuracy. However, the application of HMM in information extraction islimited because of the low accuracy caused by the sensitivity to initial conditions,inconsideration of the state transition probabilities and the correlation between output valuesof observersâ€™ probabilities and the historical states. For the existent problems in Webinformation extraction with HMM, combing with the research background, we studied theWeb information extraction methods and the classification methods. The main contents andconclusions are as follows:(1) Web information extraction model is researched. For the problems of HMM such asthe inconsideration of the state transition probabilities and the correlation betweenprobabilities outputted by observed values and the historical states, HHM2is used as theunderlying model. For the local minimum problem caused by using the Baum-Welchalgorithm to train HMM, the SA is used to train the HHM2and the SA-HMM2trainingalgorithm is proposed. The evaluation function P (O|)of the HMM2is introduced as theobjective function for optimization calculation to obtain the global optimal parameters.Experiment results demonstrate that the SA-HHM2increases the performance by21%and7%respectively over other two methods inF1value.(2) Web information extraction methods are researched. According to the projectrequirements, the proposed model is applied to the agricultural website information extraction.Firstly, we analyze the Web page by carrying on the pretreatment of previous pages andcleaning up the irrelevant information. Secondly, VIPS algorithm is used to collect statetransition sequences. Finally, the appropriate SA parameters are obtained through theexperiment and the Web information extraction based on SA-HMM2is implemented.Experiments demonstrate that the proposed method increases the comprehensive valueF1ofdifferent extraction domains by an average of13%and7%than the information extraction method based on HMM and SA-HMM respectively.(3) Web text classification is researched. According to the extraction results ofagriculture Web information, we firstly process the column-known agriculture textinformation with word segmentation, text representation and feature selection to build thecolumn keyword dictionary. Then the word segmentation, text representation and featureselection are applied on column-unknown agriculture text information. Combined withcolumn keyword dictionary after classification, the KNN algorithm is used for classification.Experimental results demonstrate that the KNN algorithm has a high classificationperformance for each column and the average value ofF1is0.844.

Keywords/Search Tags:

Web information extraction, text classification, hidden Markov model, second-order hidden Markov model, simulated annealing algorithm, K-nearest neighboralgorithm

Related items

1	Algorithm Research For Text Information Extraction Based On Hidden Markov Model
2	Parameter Estimation Of Hidden Markov Model And It's Application In News Classification
3	The Algorithm Research Of Chinese Information Extraction Based On The Hidden Markov Model
4	Text Classification Based On Hidden Markov Model And Semantic Fusion
5	Web Free Text Information Extraction Based On TABLE Layout And Hidden Markov Model
6	Research And Implementation Of Web Information Extraction Based On Improved Hidden Markov Model
7	Research On Multiresolution Hidden Markov Model For Image Denoising
8	Pulse Classification Based On Hidden Markov Model
9	Research Of Web Text Mining Technology Based On Hidden Markov Model
10	Based On Hybrid Genetic Annealing Algorithm For Web Information Extraction Method