Font Size: a A A

The Study Of The Author’s Recognition Of Chinese Modern Novels

Posted on:2019-11-17Degree:MasterType:Thesis
Country:ChinaCandidate:L XiaoFull Text:PDF
GTID:2405330596961606Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Author recognition in modern Chinese fiction is actually a problem of text classification,that is,to classify the authors according to their writing style,so as to identify the authors to whom the unknown works belong.The author chooses the modern fiction based on the May 4th Movement as the research object,because the works of the May 4th Movement have a small time span and the author’s strokes are similar.By classifying the articles with similar style,the best effect of text classification in the field of natural language processing has been achieved.The main processes of text categorization are data acquisition,text preprocessing,feature extraction,programming model design,application model and prediction results.The author’s work is as follows:Firstly,data is acquired and crawled from the network using crawler technology.In this paper,Python is chosen as the project language and Scrapy framework is used to crawl data.The author selected 164 articles from 7 authors and divided them into training set and test set according to the ratio of 8:2.Text preprocessing includes data cleaning,word segmentation,text representation and so on.The data crawled from the network contains a lot of HTML statements and blank lines,which are removed first.Then,Chinese word segmentation is carried out by using Jieba word segmentation tool.The text after word segmentation contains 160,000 words,which belongs to unstructured natural language.If the machine wants to recognize,it needs to be expressed as a structured machine language,that is,text representation.The text shows that the author uses the word bag model to express all words into machine-recognizable digital language,such as 139863,58421,etc.In order to achieve better effect of text categorization,the author uses TFIDF to extract features.According to the characteristics of classification according to author’s writing style,the TF in TFIDF formula is treated extra.In TFIDF code,high-frequency words are removed.For words such as “ah” and “ba”,which often appear in articles,they are removed before feature extraction.According to this feature,the author thinks that some high-frequency words,such as names of people and places,do not reflect the author’s writing style,but play an important role in machine learning.After feature extraction,removing these words does not affect the effect of author recognition.Many tests show that when TFIDF is 0.09,most of the words appear are names,place names,etc.Removing these high-frequency words can improve the accuracy of the author’s recognition.At present,machine learning algorithms commonly used are Na?ve Bayesian,Logic Regression,Support Vector Machine,K-Nearest Neighbor,Decision Tree,Neural Network,etc.The author tries to use these models to model and optimize the parameters to make the models achieve the best results.After many tests,it is found that the accuracy of Na?ve Bayes,Support Vector Machine and Neural Network can reach 100%.After many tests,it is found that the average prediction effect of Neural Network is the best.The author chooses to use God.The network is used as the best use model,and TFIDF special processing is added for final test.The evaluation criteria use the accuracy,recall,f1-score of training set,and the accuracy of test set.In order to prevent over-fitting and under-fitting,the test process will synthesize all the indicators.There are two improvements in this paper” one is to remove high-frequency words after TFIDF feature extraction,and to find out that when the TFIDF equilibrium value is 0.09 through many tests,most of them are names and place names,which can improve the accuracy of author recognition;the second is to quantify the similarity of author’s style and deduce the formula for calculating author’s similarity.That is to say,the similarity between the authors can be calculated according to the test formula.Quantification of author similarity can be used as a reference for classification according to author’s writing style.All the work of text is completed independently by the author under the guidance of his tutor.Through this research,the author has stepped into the threshold of artificial intelligence.In the future,I will continue to study TFIDF and neural network.I hope that this paper can contribute to the classification of Chinese test,and also hope that the classification results of this paper can be used for the workers of author recognition research.
Keywords/Search Tags:Author Recognition, TFIDF, High Frequency Words, Text Classification, Accuracy
PDF Full Text Request
Related items