Font Size: a A A

Knowledge Based Supervised Learning

Posted on:2009-07-01Degree:MasterType:Thesis
Country:ChinaCandidate:C L ZhangFull Text:PDF
GTID:2178360275970259Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
This thesis study the problem about learning with knowledge based data. Traditionalmachine learning algorithms relies on high quality labeled data to model and predict unla-beled data. However, there are a famous problem that labeling data is very time-consumingand costly. It has become a bottleneck for the development of machine learning. Web pageand text classification are important applications of machine learning. In order to efficientlyclassify web pages, machines need a large amount of labeled documents. In this thesis, wenotice a trend in current web. With the development of web service and applications, thereare more and more public data available in the internet. They contains not only ?at text data,but also extra information like labels and structures. Since anyone can easily obtain suchweb information. We are interested in the problem to utilize such data to supervise the ma-chine learning process, especially text classification process, and alleviate their requirementon labeled data and documents.For this purpose, this thesis deal with the problem from two aspects. Firstly, we designa knowledge data acquiring algorithm; secondly, we design a knowledge supervised learningalgorithm.In order to design knowledge data acquiring, we focus on studying how to automati-cally label web page data, and make them become knowledge. Our idea is to utilize existinghuge taxonomy and classify web pages into such taxonomy. The difficulty is, there are toomany candidate classes, which make traditional machine learning and text classification al-gorithm not work well. Moreover, large scale requires high efficiency. This prevents us fromcomplicated algorithms or incorporating too many extra information. This thesis noticed thatNaive Bayes Classifier is very fast, efficient and easy implemented. They are valuable fea-tures to the problem discussed here. Though Naive Bayes Classifier performance very badin presence of a large number of classes, this thesis deep analyze the characteristics of Naive Bayes and find out two server problems that largely hurt the performance of Naive Bayes.By fixing these two problems, the thesis significantly improve the performance and make itable to provide reliable knowledge.In order to design knowledge supervised algorithm, this thesis study how to utilizeknowledge data with categories to replace traning data, and reach good performance. Thedifficulty is, knowledge cover a large amount of semantic topics, while test data usually veryshallow and only cover few topics. To overcome this obstacles, this thesis design a two-stagerisk minimization algorithm. In the first stage, this algorithm generate related knowledgedata for the test data. In the second stage, knowledge and test data mutually communicate.This deep mine the underlying useful information of the knowledge. The entire algorithm isdesigned under the risk minimization framework. This algorithm get very good performancein the experiment. It not only significantly improve the baseline, but also achieve comparableperformance against learning with labeled documents.
Keywords/Search Tags:Knowledge, machine learning, text classification, Naive Bayes Classification, risk minimization framework
PDF Full Text Request
Related items