Automatic assessment of non-topical properties of text by machine learning methods

Posted on:2006-06-22

Degree:Ph.D

Type:Dissertation

University:Rutgers The State University of New Jersey - New Brunswick

Candidate:Sun, Ying

Full Text:PDF

GTID:1458390005499224

Subject:Information Science

Abstract/Summary:

This study takes some first step towards automatic classification of texts with regard to non-topical properties, using machine learning techniques together with simple linguistic features. The six properties investigated are: "Accuracy", "Reliability", "Objectivity", "Depth", "Conciseness", and "Multiple points of views". The importance of these properties, in satisfying users' information needs and/or in control of information quality, has been widely accepted, and they are amenable to the automatic analysis. We test the learnability of human judgments on these properties at two levels: the general level with the goal of constructing global rules from a group of persons' judgments and the individual level.The "combined" corpus, for the purpose of general level learning, is a combination of work by multiple judges on 3,200 texts. There are also five individual judgment corpora, with about 500 documents in each. The judgments are on 10-point Likert scales. Four statistical and machine learning techniques: linear regression, logistic regression, decision tree C4.5 and Support Vector Machines (SVM), are applied to the automatic assessment task.The linear regression experiments show that we cannot adequately differentiate texts at 10 distinct levels on each non-topical dimension. However, the experiments demonstrate that binary classification techniques (Logistic Regression and Linear SVM), together with the simple language and textual features, can automatically assess the six non-topical properties of documents at levels better than chance. The prediction performance gets better when middle range documents are removed from the dataset. The classification tasks for "Depth" and "Multi-views" are relatively easier than the other four tasks. With the current set of predictive features, we cannot yet identify the middle range documents.Automatic assessment of one individual judge's assignment is slightly, but not significantly, better than learning for all judges at once. For some judges and some properties, our method achieved very good classification results. Linear based learning methods, logistic regression and linear SVM, are better tools than Boolean-based decision tree (C4.5) method. The advanced SVM method does not show significant superiority over the logistic regression method. The models with good performance tend to contain intuitively reasonable features.

Keywords/Search Tags:

Non-topical properties, Machine learning, Automatic, Method, Logistic regression, Features, Classification, SVM

Related items

1	Research On Image Recognition Techniques Of Decorated Granite Based On Color Features And Logistic Regression
2	Application Of Machine Learning Classification Algorithm In Resident Income Prediction
3	The Research On Transdctive Transfer Learning With The Logistic Regression Model
4	Research On Logistic Regression Learning Algorithm For Imbalanced Problem
5	Research And Development Of Key Technologies On Key Words Extraction And Sentiment Classification Of Video Website Comment
6	Application Of Gradient Descent Method In Machine Learning
7	Research On Automatic Classification Algorithms For All-sky Cloud Image
8	Realization Of Machine Learning Classification Algorithms In The Hadoop Development Environment
9	Research On Regression And Classification Methods Based On Multiple Parallel Extreme Learning Machine
10	ADMM-type Algorithms For Regression Problems Based On Regularization