Font Size: a A A

Research On Text Representation Technologies For Readability Assessment

Posted on:2019-02-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z W JiangFull Text:PDF
GTID:1318330545975713Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology,the size and age range of the users accessing to the Internet are expanding,and the frequency of publishing,spread-ing and obtaining information in the Internet is also increasing.As one of the main carriers of Internet information,text not only grows faster and faster,but also has more and more diversified sources and expressions.It is becoming more and more difficult for users to find the text meeting their specific needs.All of these make it becoming more and more important to analyze and deal with the text by using computer tech-nology and other automated means.It is an important field concerned by many com-puter scientists and linguists that how to efficiently manage,analyze and understand the massive natural language texts.Within the field of natural language processing,the readability assessment of text is an important branch.For the readability assessment of text,the core issue is to establish the relationship between text features and readability categories(scores),which include two important steps:text representation and model learning.Text representation refers to the trans-formation of text,which transforms the text from natural language into another form of data that makes the model easier to handle and learn rules from it.We also call the representation as feature.Model learning refers to the process of learning parameter from the supervised readability knowledge,which establishes the relationship between the text features and the readability categories.Based on the model,the readability of the unknown text can be predicted.Since different kinds of classification model have their own preferences and different kinds of text have their own characteristics,the importance and flexibility of text representation(intermediate transformation)are particularly prominent.Over the past decade,research on text representation has constituted a significant part of the research on readability assessment.In these works on text representation,researchers mainly focus on feature set expansion and target language migration.In the direction of feature set expansion,more and more features based on new technologies and theories are designed to provide more information for learning more accurate mod-els,In the direction of target language migration,researchers have gradually begun to explore the readability of text in other languages other than English,such as German and French,which effectively extend the application scope of readability assessment.These studies are very valuable,but they are limited to the traditional inductive learn-ing settings and a few languages.In order to obtain more extensive applications,and more accurate and efficient assessment performance,these restrictions may need to be further relaxed,such as setting target language as Chinese,using push learning set-tings,or using a deep learning framework.While the relaxation brings benefits,it will also challenge the research of text representation technology.From the following three perspectives,we will do more in-depth research on text representation technology in readability assessment under different scenarios.(1)For the readability assessment of Chinese documents,we study the hand-crafted feature extraction based on the language characteristics.Most of the existing methods of readability assessment are designed for English.The studies on other lan-guages such as German and French,have also attracted the attention of researchers in recent years.However,the readability assessment for Chinese documents has been rarely studied.Thus,we propose a readability assessment method for Chinese docu-ments based on the hand-crafted feature extraction.For the design of the readability features of Chinese text,we consider two aspects.On the one hand,we draw lessons from the experiences of other language to transfer the language independent features,on the other hand,we redesign the features specific to the characteristics of Chinese,such as word segmentation,word and stroke in Chinese.We designed a total of five sets of features,which are measured from the aspects of vocabulary,parts of speech,grammar,information and other aspects of the readability.Based on these features,we proposed an ordered multi classification framework to classify text readability.Ex-perimental results show that our features are very useful for improving the evaluation results.Our proposed classification framework can effectively utilize extracted fea-tures.(2)For the readability assessment under the setting of transductive learning,we propose a feature transformation method based on word coupling.Existing readability assessment methods usually build inductive classification models to evaluate the read-ability of the document,which is proved to be very effective.However,the inductive method does not make use of the readability relationship among documents,which is also helpful for the accurate assessment.In order to make use of this relationship,we need to adopt the transductive classification method and model the relationship among documents on readability.But during the experiment,we found that the directly use of traditional features can not effectively model the relationship on readability,which makes it necessary for us to further deal with the text representation.Therefore,we propose a readability assessment method based on feature transformation.We modify the basic bag-of-words model by word coupling,so that it can adapt to the readability assessment.Based on the reformed text representation and the existing features engi-neering,we propose a two-view graph propagation algorithm to simultaneously use the improved bag-of-words model and the hand-crafted features.The experimental results on two datasets of Chinese and English demonstrate the effectiveness of our proposed method.(3)For the readability assessment under the representation learning framework,we propose a method of automatic feature learning based on domain knowledge.Most of the existing methods of readability assessment rely on hand-crafted feature engineer-ing,which is important but time-consuming.A better way is to automatically learn text representation from data,which is also called representation learning.Therefore,we propose an end-to-end readability assessment method based on representation learn-ing.By combining text information and domain knowledge,we extend the existing word embedding model and design a knowledge-enriched word embedding learning model for readability assessment.Based on the knowledge-enriched word embedding,we further propose two word embedding-based readability assessment methods,which can construct the representation of documents from the representation of their words and then assess their readability based on their representations.The experiments are conducted on four data sets in two languages to demonstrate the effectiveness of our method.
Keywords/Search Tags:Readability Assessment, Text Representation, Feature Extraction, Feature Transformation, Representation Learning
PDF Full Text Request
Related items