Font Size: a A A

Study Based On The Contents And Characteristics Of Web Information Extraction Methods

Posted on:2011-07-26Degree:MasterType:Thesis
Country:ChinaCandidate:J GaoFull Text:PDF
GTID:2208360308475866Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, Web information has become one of the important sources of information. It takes the people's work, study and life a great convenience. There are many pornographic, reactionary, violence and other unhealthy content of the letter, and even some law-breakers to use BBS, e-mail and so on the reactionary propaganda, fraud, extortion and illegal activities, such as terrorist threats. This illegal information is waste of the network resources, undermines the harmonious civilization, and also to the social order caused great harm. This needs to detect the true author of Web information and take measure. However, the sender always attempts to hide their true identity in order to avoid detection. So it is difficult to find out the true identity of Web information. How to identify the true author of Web information has become a current urgent problem.In the research on Web information authorship identification, its content and its features are the key basic problem. Web information with the characteristics of semi-structured, a feature that allows Web information use can not be resolved directly. Undoubtedly this research has an important practical value and great realistic meaning.Based on this research background, the study of Web information and feature extraction to conduct in-depth study of aspects. The purpose of this thesis is the text of Chinese Web information for the study, drawing on Chinese e-mail feature extraction methods, analysis of the contents of Web information and feature extraction methods.At first this thesis makes introduce on the present situation in this field, and studies the present techniques and methods. This thesis analyzes detailed format and content about Web information and studies the method of Web content extraction. The abstraction method of Web pages content extraction,e-mail information extraction and decoding are studied.
Keywords/Search Tags:Web information, Literary Feature, Feature Selection & Extraction
PDF Full Text Request
Related items