Font Size: a A A

Web Text Information Extraction And Classification

Posted on:2015-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:W N LiFull Text:PDF
GTID:2298330434964991Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology and the explosive growth of networkinformation, it becomes very difficult for the users to find the required information from themass information accurately and quickly, and this situation drives the research on Webinformation extraction technology. Hidden Markov Model (HMM) attracts increasingattention of researchers because of its advantages of easy establishment, strong flexibility andhigh extraction accuracy. However, the application of HMM in information extraction islimited because of the low accuracy caused by the sensitivity to initial conditions,inconsideration of the state transition probabilities and the correlation between output valuesof observers’ probabilities and the historical states. For the existent problems in Webinformation extraction with HMM, combing with the research background, we studied theWeb information extraction methods and the classification methods. The main contents andconclusions are as follows:(1) Web information extraction model is researched. For the problems of HMM such asthe inconsideration of the state transition probabilities and the correlation betweenprobabilities outputted by observed values and the historical states, HHM2is used as theunderlying model. For the local minimum problem caused by using the Baum-Welchalgorithm to train HMM, the SA is used to train the HHM2and the SA-HMM2trainingalgorithm is proposed. The evaluation function P (O|)of the HMM2is introduced as theobjective function for optimization calculation to obtain the global optimal parameters.Experiment results demonstrate that the SA-HHM2increases the performance by21%and7%respectively over other two methods inF1value.(2) Web information extraction methods are researched. According to the projectrequirements, the proposed model is applied to the agricultural website information extraction.Firstly, we analyze the Web page by carrying on the pretreatment of previous pages andcleaning up the irrelevant information. Secondly, VIPS algorithm is used to collect statetransition sequences. Finally, the appropriate SA parameters are obtained through theexperiment and the Web information extraction based on SA-HMM2is implemented.Experiments demonstrate that the proposed method increases the comprehensive valueF1ofdifferent extraction domains by an average of13%and7%than the information extraction method based on HMM and SA-HMM respectively.(3) Web text classification is researched. According to the extraction results ofagriculture Web information, we firstly process the column-known agriculture textinformation with word segmentation, text representation and feature selection to build thecolumn keyword dictionary. Then the word segmentation, text representation and featureselection are applied on column-unknown agriculture text information. Combined withcolumn keyword dictionary after classification, the KNN algorithm is used for classification.Experimental results demonstrate that the KNN algorithm has a high classificationperformance for each column and the average value ofF1is0.844.
Keywords/Search Tags:Web information extraction, text classification, hidden Markov model, second-order hidden Markov model, simulated annealing algorithm, K-nearest neighboralgorithm
PDF Full Text Request
Related items