Font Size: a A A

The Research&Appliance Of Text-Categorization Algorithm In Large-scale Heterogeneous Environment

Posted on:2013-10-31Degree:MasterType:Thesis
Country:ChinaCandidate:Y G WangFull Text:PDF
GTID:2248330371968759Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The computer application that takes networking as an important part have marchedinto a glorious era, with many new application environment and needing appearing all thetime. For some large-scale systems such as search engines and social networks, datagenerates at a rather high speed. How to cope with these data effectively and mine thevalue behind it is the problem that industry are trying to resolve. Meanwhile, most dataexists heterogeneously, making it a challenging job of effectively utilizing them.Text-categorization technology is such a key technology that still remains the same inthe lage-scale data processing environment. Although traditional classifying methods havemany inherint virtues, they always have limitations in speed, does not suit the highdata-flow environment. In order to solve these problems, this dissertation made some trailsin the research, including:1) Proposeing a speedy text-categorization algorithm through calculations on singleChinese character related computing.2) Designing a simple, scalable distributed web crawler in order to grab web contentsrapidly.3) Making some explorations on how to unify heterogeneous data via a XML-basedmethod, in the stage of web page processing, proposed an algorithm for extracting textcontent from a web page through a DOM-based view.4) Implementing a practical general search system, equipped with a categorizationfunction, which provides a find-grained control for the users’ search action, with users’experience improved.
Keywords/Search Tags:Heterogeneous Data Uni-processing, Text-Categorization, Web PageContent Extraction, Data Processing in Large-scale Mode, InformationAcquisition
PDF Full Text Request
Related items