Font Size: a A A

Automatic assessment of non-topical properties of text by machine learning methods

Posted on:2006-06-22Degree:Ph.DType:Dissertation
University:Rutgers The State University of New Jersey - New BrunswickCandidate:Sun, YingFull Text:PDF
GTID:1458390005499224Subject:Information Science
Abstract/Summary:
This study takes some first step towards automatic classification of texts with regard to non-topical properties, using machine learning techniques together with simple linguistic features. The six properties investigated are: "Accuracy", "Reliability", "Objectivity", "Depth", "Conciseness", and "Multiple points of views". The importance of these properties, in satisfying users' information needs and/or in control of information quality, has been widely accepted, and they are amenable to the automatic analysis. We test the learnability of human judgments on these properties at two levels: the general level with the goal of constructing global rules from a group of persons' judgments and the individual level.The "combined" corpus, for the purpose of general level learning, is a combination of work by multiple judges on 3,200 texts. There are also five individual judgment corpora, with about 500 documents in each. The judgments are on 10-point Likert scales. Four statistical and machine learning techniques: linear regression, logistic regression, decision tree C4.5 and Support Vector Machines (SVM), are applied to the automatic assessment task.The linear regression experiments show that we cannot adequately differentiate texts at 10 distinct levels on each non-topical dimension. However, the experiments demonstrate that binary classification techniques (Logistic Regression and Linear SVM), together with the simple language and textual features, can automatically assess the six non-topical properties of documents at levels better than chance. The prediction performance gets better when middle range documents are removed from the dataset. The classification tasks for "Depth" and "Multi-views" are relatively easier than the other four tasks. With the current set of predictive features, we cannot yet identify the middle range documents.Automatic assessment of one individual judge's assignment is slightly, but not significantly, better than learning for all judges at once. For some judges and some properties, our method achieved very good classification results. Linear based learning methods, logistic regression and linear SVM, are better tools than Boolean-based decision tree (C4.5) method. The advanced SVM method does not show significant superiority over the logistic regression method. The models with good performance tend to contain intuitively reasonable features.
Keywords/Search Tags:Non-topical properties, Machine learning, Automatic, Method, Logistic regression, Features, Classification, SVM
Related items