Reasersh On Internet Public Opinion Information Extraction And Classification

Posted on:2009-11-25

Degree:Master

Type:Thesis

Country:China

Candidate:X O Jin

Full Text:PDF

GTID:2178360242977091

Subject:Communication and Information System

Abstract/Summary:

This paper makes relatively deep discussion in the field of Internet public opinion information extraction and classification. By using the Rhino script engine, it can be solved effectively that extracting the Internet JavaScript dynamic web page. And, by combining the text classification based on VSM (vector space model) and which based on semantics, making modification to traditional text orientation, the accurate and universality of the text orientation classification has been improved.At the first part, the extraction of JavaScript dynamic web page is discussed. It turns back to the achievement in the field of web page extraction. It lists the basic research in the areas of web page extraction, HTML markup language, HTTP network protocol, URL, etc. Moreover, it lists the research in dynamic web page, and points out that some modifications are necessary to web page extraction because of booming of web page technology.On the basis of these research works, this paper makes detailed explanation to the hyperlink extraction mechanism. Although traditional web page extraction can extract the hyperlink address which included in the web page by using the feature of HTML, it does not resolve the problem of extracting the hidden hyperlink address in the web page. However, with the development of web page technology, more and more hyperlink addresses are hidden in the script of web page. Therefore, by some analysis, adding script engine to the web page extraction is one of the best solutions to this problem. Verified by the experiment, this solution can improve the extraction rate of web page.Followed, the web page text orientation classification is discussed. It turns back to the achievement in the field of text classification. It lists the basic research in the areas of text classification, text participle, text expression, feature selection and classification algorithm, etc. Moreover, it points out that research on text orientation classification will be a important development direction with the develop of technology.On the basis of these research works, this paper makes detailed explanation to the classification algorithm. Although the existing technology can classify the text into sports, entertainment, policy and so on, but it can not do the classification based on the affective characteristics expressed by the author very well. Therefore, by some analysis, the combination of classification based on vector space model and which based on semantic is one of the best solutions to this problem. By experimental verification, this solution is proved that it can do the text orientation classification effectively.

Keywords/Search Tags:

Web Page Extraction, Text Orientation Classification, Dynamic Web Page

Related items

1	The Study And Implementation On The Key Problems Of Intelligent Search Engine Technology
2	Research On Web Page Classification And Information Collection
3	Research And Implementation On A Web Page Classification System
4	Research And Implementation On Key Technology Of Web Text Collection And Analysis
5	Research On WEB Page Classification Algorithms Based On Text Semantic Graph
6	Research On Improved KNN Chinese Web Page Classification Based On Weka Platform
7	Research On Multi-page Special Web Page Text Extraction And Merging Technology
8	Study On Web Page Ratinality Of Universities’ Websites In Hebei Province
9	Research Of Web Page Purifying Method Based On Document Object Model
10	Research And Implement Of Topic Oriented Web Page Classification Technique