Punctuation proofreading is an important part of Chinese text Proofreading.The misuse of punctuation marks can affect the readability of texts.Because the use of punctuation marks is closely related to the semantics,semantic information is difficult to learn through traditional machine learning methods which performs poorly on punctuation proofreading task.However,deep learning can make use of contextual information to a great extent.It has been widely used in the field of Natural Language Processing in recent years,and it has made great progress in the problems of speech recognition and text classification.This dissertation converts the punctuation proofreading problem into a classification problem,and studies it with deep learning methods.Its main work consists of the following two aspects:(1)An LSTM-CNN punctuation classification model is proposed.The model includes a multi-layer LSTM and three juxtaposed CNNs with different filters.Multi-layer LSTM can extract higher level semantic information.As a CNN layer can only have one convolution kernel,three CNNs with different filters are connected in parallel behind the last layer of LSTM in order to obtain more textual features.Since there are many hyperparameters in the model,the optimal values of several important hyperparameters in the LSTM-CNN model are determined through multiple sets of comparative experiments.In addition,KNN,SVM and Naive Bayesian Classifier are used in comparison experiments to verify the effectiveness of the presented LSTM-CNN model.Experimental results show that LSTM-CNN model is superior to traditional machine learning methods in performance.(2)An Attention-Based LSTM-CNN punctuation classification model is proposed.An improved attention mechanism is added between the last LSTM layer and the CNN layer,which makes the outputs of different time steps of the LSTM have different attention weights.It allocates more attention to important words in sentences,and it is more conducive to the classification of punctuation marks.The optimal value of attention-size is determined through a series of comparison experiments.The experimental results show that the attention-based LSTM-CNN model is superior to the LSTM-CNN model and traditional machine learning algorithms.In order to verify the impact of the sentence after the punctuation mark on the punctuation classification problem,the two sentences around the punctuation mark are used as an input to the attention-based LSTM-CNN model.Experimental results indicate that the sentence after the punctuation mark has a great influence on the performance of the model. |