Font Size: a A A

Dynamic Classification For Ultrahigh Dimensional Binary Data

Posted on:2015-12-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:G Y GuanFull Text:PDF
GTID:1228330431987614Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
With the rapid development of science and technology, and the arrival of the era of big data. The datasets have the characters of massive, ultrahigh dimensional, sparse time-varying, etc. Statistics, as a basic data analysis technology, has been attracted more and more attention, but also faces new challenge. For example, document clas-sification is an area of great importance and many classification methods have been developed. How to get useful information from such complex datasets, and classify documents automatically, will be mainly studied in this article.In this study, we present a novel method for Chinese document classification, i.e., varying naive Bayes model. This is a data-driven method. The documents were generated by the Mayor Public Hotline (MPH) of Changchun, the capital city of Jilin Province in Northeast China. The goal is to automatically classify the MPH records from local residents (i.e., Chinese text documents) according to the functional depart-ments in the local government. To this end, a bag of most frequently used Chinese words in MPH dataset are collected. According to whether these words have appeared in the document, a high dimensional vector with binary features can be constructed. Because the total number of words is huge, the dimension of the vector is ultrahigh. This makes the method of naive Bayes particularly attractive. However, most of the words (or features) are irrelevant for classification. As a result, the task of feature screening becomes important. We study the problem of feature selection for ultrahigh dimensional binary data first. To this end, a feature selection method based on Lo-regularization for naive Bayes model is proposed. This method is optimal in the sense of model selection. Both theoretical results and simulation studies confirm that our fea-ture selection method is consistency for the case of ultrahigh dimensional. However, in practice, there is no clear dividing line between relevant and irrelevant features. Mo-tivated by this, a method of feature weighting is proposed, and the prediction accuracy can be further improved.Past experience suggests that the MPH documents recorded at different time of a day might follow different classification patterns. Unfortunately, a standard naive Bayes model cannot take this into consideration. To solve the problem, we propose a method of varying naive Bayes modeling. The new method adopts a standard naive Bayes formulation for the documents recorded at the same time of a day. However, the documents recorded at different time are allowed to follow different classification patterns. This is done by allowing the model parameters to vary smoothly and nonpara-metrically according to the recording time. Nonparametric smoothing techniques are used to estimate the unknown parameters. A BIC-type criterion is proposed to identify important features. The asymptotic properties of the proposed method is investigated. Its outstanding performance is numerically confirmed on both simulated and the MPH datasets.Despite the fact that our research is motivated by the MPH project, the developed methodology is applicable to any classification problems with binary features and time varying structure. It can also be extended naturally to continuous data and other dis-crete data. Predictably, this method has broad application prospects.
Keywords/Search Tags:Bayesian Information Criterion, Ultrahigh Dimensional Binary Data, Varying Naive Bayes, Chinese Document Classification, L0-regularization, ScreeningConsistency, Feature Selection, Feature Indicator
PDF Full Text Request
Related items