| With the great development of ideology and culture,people create a large number of literatures.Therefore,it puts forward new challenges to the application of natural language processing technology.We collect a large number of English novels and construct a corpus of literary text.Neural network is used to represent characters,and the learned character embedding is used for characters clustering and classification.This thesis applies natural language processing technology to literary field,which will promote the research of both natural language processing and literature.The specific research contents are as follows:(1)The construction of the literary corpus.More than 20,000 English novels are collected from Project Gutenberg,and preprocessing steps include tokenization,part-of-speech tagging,named entity recognition,dependency parsing,name clustering and coreference resolution.In total,we extract over 400,000 characters and all feature words with certain dependency relationships with those characters,and carry out statistical analysis of the those feature words.(2)Dependency feature based distributed character representation and its analysis.Like the training of word embedding,we utilize skip-gram model to train both character embedding and dependency word embedding simultaneously.Based on the learned character embedding,character similarity is computed.Experimental results show that our method outperforms topic models in three out of four hypothesis classes in the test set.In addition,by analyzing nearest neighbors of certain characters,characters from different novels of the same author tend to be similar to each other.(3)The applications of the character embeddings.The learned character embedding is applied to character clustering and classification.Assuming characters in a cluster belong to the same author,the k-means based clustering model achieves a purity of 0.724.We build a dataset for personality and gender classification,and propose a multilayer perceptron model based on the learned character embedding,and achieve a better performance on both tasks than a model based on the average pooling of word embedding. |