Font Size: a A A

A Research On Automatic Web Text Classification Technology

Posted on:2007-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:D X CuiFull Text:PDF
GTID:2178360242961844Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As an effective technique to discover potentially valuable knowledge through the massive text information resources in the WWW, Web Text Mining is in its ascendant. Web Text Classification is a hotspot in the field of Web Text Mining. The performances of generic classifying algorithms in machine learning can be improved by making better use of the characteristics of web text data, so it's necessary to study some better methods of combining generic classifying algorithms with web text data.As an important appliance of text classifying, a junkmail filtering system must consider different impacts of misclassifying different classifications. After defining a loss function and combining it with Bayes theorem, a minimal loss based filtering method is designed, which represents a mail as a Boolean vector and selects features with IG. The experimental results on PU1 verify the efficiency of defining such a loss function.To make full use of the characteristics of web documents, a web text is represented as sequences, in which a minimal element is a word, in order to consider the rich semantic information implied by the mutual positions among terms. A solution to DNA sequence analysis problems in computational biology is applied to text classification to implement a text classifying methods called SSAM, which uses signature sequences to describe the characteristics of classifications. The experimental results on Reuters-21578 indicate SSAM behaves better than Na?ve Bayes, and it has a high classifying speed.The web text classification procedure is divided into several steps, namely, constructing text collection, preprocessing web pages, training and classifying. An SSAM-based automatic web text classifying system, which can process Chinese text, is designed, and its prototype is implemented using Visual C# on a pc.
Keywords/Search Tags:Text Classification, Vector Space Model, Minimal Loss, Na?ve Bayes, Signature Sequence Analysis
PDF Full Text Request
Related items