Font Size: a A A

Reasersh On Internet Public Opinion Information Extraction And Classification

Posted on:2009-11-25Degree:MasterType:Thesis
Country:ChinaCandidate:X O JinFull Text:PDF
GTID:2178360242977091Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
This paper makes relatively deep discussion in the field of Internet public opinion information extraction and classification. By using the Rhino script engine, it can be solved effectively that extracting the Internet JavaScript dynamic web page. And, by combining the text classification based on VSM (vector space model) and which based on semantics, making modification to traditional text orientation, the accurate and universality of the text orientation classification has been improved.At the first part, the extraction of JavaScript dynamic web page is discussed. It turns back to the achievement in the field of web page extraction. It lists the basic research in the areas of web page extraction, HTML markup language, HTTP network protocol, URL, etc. Moreover, it lists the research in dynamic web page, and points out that some modifications are necessary to web page extraction because of booming of web page technology.On the basis of these research works, this paper makes detailed explanation to the hyperlink extraction mechanism. Although traditional web page extraction can extract the hyperlink address which included in the web page by using the feature of HTML, it does not resolve the problem of extracting the hidden hyperlink address in the web page. However, with the development of web page technology, more and more hyperlink addresses are hidden in the script of web page. Therefore, by some analysis, adding script engine to the web page extraction is one of the best solutions to this problem. Verified by the experiment, this solution can improve the extraction rate of web page.Followed, the web page text orientation classification is discussed. It turns back to the achievement in the field of text classification. It lists the basic research in the areas of text classification, text participle, text expression, feature selection and classification algorithm, etc. Moreover, it points out that research on text orientation classification will be a important development direction with the develop of technology.On the basis of these research works, this paper makes detailed explanation to the classification algorithm. Although the existing technology can classify the text into sports, entertainment, policy and so on, but it can not do the classification based on the affective characteristics expressed by the author very well. Therefore, by some analysis, the combination of classification based on vector space model and which based on semantic is one of the best solutions to this problem. By experimental verification, this solution is proved that it can do the text orientation classification effectively.
Keywords/Search Tags:Web Page Extraction, Text Orientation Classification, Dynamic Web Page
PDF Full Text Request
Related items