Font Size: a A A

The Design And Implementation Of Text Data Acquisition System Focused On News Field

Posted on:2011-07-18Degree:MasterType:Thesis
Country:ChinaCandidate:B L WangFull Text:PDF
GTID:2178360308462282Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Text automatic classification is a technology which is to assign text to one or more scheduled category based on analyzing text content. As information is such expansion and information source is such diversity, to search and get information manually is not only complicated, but also low efficiency, quality of information is also not guaranteed. Therefore effective measures to acquire subject related information and then to store them by classified has become hotspot question in information processing research.This paper introduces text data acquisition system focused on news field from the perspective of information management system which integrated the information management and the information acquisition. On the one hand, the system provides friendly management interface for news information, on the other hand, the system provides feature which is to acquire text information automatically, by which the system could automatically fetch text news information from internet and then to classify and store them.Based on the research of web crawler technology, web filter technology, text representation and Chinese text classification method, designing and implementing of function which is to acquire web information automatically is introduced. News information which is fetched by web crawler from internet is saved to local memory in the form of text, and then using specific web content extraction technology to extract information from web text page, such as news headline, news content. Finally, specific text feature extraction technology is used to extract feature information of text news, and to map the text news to specific category based on Bayesian Classification Algorithm.The system is introduced from aspect of requirement analysis, design and implement etc.,and related technologies which are used are also analyzed. Finally, test is introduced, and test results are analyzed.
Keywords/Search Tags:web crawlers, Chinese word segmentation, web content extraction, text features, text categorization
PDF Full Text Request
Related items