Based On The Hmm Education News Extraction And Classification

Posted on:2013-03-31

Degree:Master

Type:Thesis

Country:China

Candidate:J G Liang

Full Text:PDF

GTID:2248330395452897

Subject:Education Technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of science technology, the internet has become an indispensable part of peopleâ€™s daily life. Facing the vast amounts of information on the internet, how to quickly and effectively find the useful information is an urgent issue to be solved. This problem could be considered to be the process of Web information extraction and Text classification.This paper proposes a system based on Hidden Markov model, which can execute automatic parse of web page, pretreatment of page, information extraction, feature selection and text classification. In brief, it can extract and classify educational news from the web. It stores the classification results in structured database, serving educational research and education practical management.Firstly, this paper introduces the concept of Web information extraction and text classification, analyzes and compares the common methods, elaborates evaluation methods of the results. And then introduces Hidden Markov model and its main algorithm.Based on the analysis on content and structural features of web pages, a solution of information extraction and text classification of educational news pages is discussed. To purify the web page, it is necessary to filtrate some noise from the web source document. And then the maximum string matching algorithm is used to find the newsâ€™headline which is important to roughly position the subject content of the news. Finally, we use the decoding algorithm to tag the state of the content mentioned above. By deleting the content tagged "Noise", we can get the useful information of the web page.In order to classify the news text, we design and improve Hidden Markov model, and then analyze its feasibility. We combine term frequency-inverse document frequency and Ï‡2to select feature items. Apriori algorithm is used to further select feature items groups that have great correlation with the textâ€™s category. By calculating the relevancy between these feature and category, we choose the maximal one as the category.At the end of the paper, we give the main algorithms of the system based Hidden Markov model.960web pages and more than3000documents were downloaded and used to testify the system. Experiments show that Hidden Markov model achieves a high precision when it is used to extract and classify web news.

Keywords/Search Tags:

information extraction, text classification, hidden Markov model

PDF Full Text Request

Related items

1	Algorithm Research For Text Information Extraction Based On Hidden Markov Model
2	Based On The Hmm Education News Extraction And Classification
3	Web Free Text Information Extraction Based On TABLE Layout And Hidden Markov Model
4	Parameter Estimation Of Hidden Markov Model And It's Application In News Classification
5	Research On Heterogeneous Academic Information Extraction And Aggregation Based On Web
6	Research And Implementation Of Web Information Extraction Based On Improved Hidden Markov Model
7	The Algorithm Research Of Chinese Information Extraction Based On The Hidden Markov Model
8	Research On Spatial And Temporal Information Extraction In Unstructured Text
9	Text Classification Based On Hidden Markov Model And Semantic Fusion
10	Application Research Of Hidden Markov Model In Information Extraction