Font Size: a A A

Study On Selected Issues Of Large Scale Text Classification

Posted on:2014-01-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z Q LiFull Text:PDF
GTID:1228330401467813Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text classification is an elemental issue in text information processing, and hasreceived wide attentions. However, with the development of social internetization, largescale text information has emerged violently. This makes text classification to face agreat new challenge. This dissertation explores the problem from the viewpoints of textrepresentation and efficiently SVM training:1. A text is always represented as a vector of weights of the words. Each weight isevaluated from the occuring frequency of a word in a text and the times a word appearsin different texts. The distributing characteristics of these measurements are examinedon real datasets. The results show that, when selecting features, words with middlefrequency are chosen preferentially, or words are divided into three groups, the highfrequency, the middle frequency and the small frequency, respectivly. It is also shownthat the IDF factor should be enhanced by frequency of a word on whole dataset.2. Phrases provide more semantic information than words. But feature selectionalgorithms are traditionally used to choose a subset of phrases to represent a text. Thisdissertation observes that, if phrases are chosen from the viewpoint of the levels of theparse tree, then the recall performance can be improved. This type of phrase can reflectits role and function in the sentence where it emerges. Experimental results show thatthe new text representation can enhance the recall performance.3. The semantic relationships between adjacent words are usually used to reformVSM. Beyond that, the dictionary semantic relationships between nonajacent words,even not arising in any text, are used to do this work. This dissertation explorescoreference in the context to enhance word frequency. In this way, the real frequenciesof features are expressed accurately from the viewpoint of semantics. Experimentalresults show that the new text representation can improve the recall performance.4. To save part of the kernel matrix in a cache is an important accelerating methodfor SVM decomposition optimization. But the behavior of traditional decompositionalgorithms doesn’t show good localization all the time. A type of three-layer workingset selection framework is proposed to localize the iterating of decomposition algorithms. Combined with multiple working set selection strategies, furtheracceleration of traditional decomposition algorithms is achieved.5. For large scale text classification, it is a good strategy to refine the problemsstep by step. Intuitively, the profile of each class of data is the most important subset forclassification task. This dissertation fits each class of data with one hyperplane.Modeled as a MEB problem, the fitting problems can be solved by optimal core setalgorithms. Experimental result shows that a very high efficiency is got when the SVMis trained on this type of small subsets. Furthermore, a very sparse solution is obtained.6. Different from fitting each class of data isolatedly, an improved fitting modelthat considers the separateness between two classes is developed. This model not onlyfits each class of data with one plane, but also leaves the other class of data on one sideof the plane as possible. Experimental result shows that a very sparse solution is madeout efficiently, and it has comparable average generalizing performance with standardSVM.7. Another improved fitting model considering the separateness between twoclasses is explored. It fits each class of data with one plane, and requires the other classof data not only lie on one side of the plane, but also as far as possible from the plane.This new method keeps similar average generalizing performance with standard SVM.It is notable that the new improvement has higher potential training efficiency becausethe fitting cost is almost near half what the training data needs in fact.
Keywords/Search Tags:SVM, MEB, optimal core set, coreference analysis, syntact phrase
PDF Full Text Request
Related items