Font Size: a A A

Based On The Hmm Education News Extraction And Classification

Posted on:2013-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:J G LiangFull Text:PDF
GTID:2248330395452897Subject:Education Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of science technology, the internet has become an indispensable part of people’s daily life. Facing the vast amounts of information on the internet, how to quickly and effectively find the useful information is an urgent issue to be solved. This problem could be considered to be the process of Web information extraction and Text classification.This paper proposes a system based on Hidden Markov model, which can execute automatic parse of web page, pretreatment of page, information extraction, feature selection and text classification. In brief, it can extract and classify educational news from the web. It stores the classification results in structured database, serving educational research and education practical management.Firstly, this paper introduces the concept of Web information extraction and text classification, analyzes and compares the common methods, elaborates evaluation methods of the results. And then introduces Hidden Markov model and its main algorithm.Based on the analysis on content and structural features of web pages, a solution of information extraction and text classification of educational news pages is discussed. To purify the web page, it is necessary to filtrate some noise from the web source document. And then the maximum string matching algorithm is used to find the news’headline which is important to roughly position the subject content of the news. Finally, we use the decoding algorithm to tag the state of the content mentioned above. By deleting the content tagged "Noise", we can get the useful information of the web page.In order to classify the news text, we design and improve Hidden Markov model, and then analyze its feasibility. We combine term frequency-inverse document frequency and χ2to select feature items. Apriori algorithm is used to further select feature items groups that have great correlation with the text’s category. By calculating the relevancy between these feature and category, we choose the maximal one as the category.At the end of the paper, we give the main algorithms of the system based Hidden Markov model.960web pages and more than3000documents were downloaded and used to testify the system. Experiments show that Hidden Markov model achieves a high precision when it is used to extract and classify web news.
Keywords/Search Tags:information extraction, text classification, hidden Markov model
PDF Full Text Request
Related items