Font Size: a A A

Research And Implementation Of Key Technologies On Web Text Classification

Posted on:2013-12-06Degree:MasterType:Thesis
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:2248330395455641Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays, the world is filled with all kinds of information; the Web Text which exists inthe electronic form has gradually become the most important source of people’s information.However, the Web Text is unorganized and dramatic, and the web page is far more complexthan the text documents. So, recently, the problem how to obtain the information which isrequired and useful from the Internet with an efficient and rapid method has become a maintopic of the scientific field. And based on the requirement, a new technique which is calledWeb Text Mining has formed. This technique contains four aspects: web text classification,web text clustering, information extraction, information retrieval. This paper mainly discussesthe web text classification.In the field of web text classification, the support vector machine (SVM) has been widelyused. This theory which is based on the statistical learning theory and the structural riskminimization rule is a kind of machine learning method. Compared with the conventionalmachine learning method, the SVM has a strong ability of Generalization and the globaloptimal solution can be obtained.Besides, it avoids some problems, such as over learning,curse of dimensionality and local extremum. Because of the above advantages, it has becomea hotspot of the scientific field. However, as a new theory, the SVM still has more researchesand improvements to be done. In all of them, the classification of mass data set and how toclassify after the data set is updated have become the key and difficulty of the research.This paper firstly deals with the web text mining, and analyzes its key techniques.Secondly, the basic concepts and related theory of the statistical learning theory and the SVMhave been discussed. In addition, because the SVM has many defects in classifies mass dataset, such as taking up higher memory, slow convergence speed and ignoring the previouslearning result, an improved algorithm has been proposed to solve multi-class problem. Thisalgorithm combines the SVM and Incremental learning together. After the data set is updated,it reserves the result of previous learning, and only classifies the new data. Thus, a consequentlearning process is formed. Last, the improved algorithm is used in the system of Web TextMining, getting a better classification result.
Keywords/Search Tags:Web text mining, Support vector machines, Multi-class problem, Incremental learning
PDF Full Text Request
Related items