| Formulaic language is a multi-word unit that appears frequently in a whole form,with continuous or discontinuous words.It generally has a clear meaning and function.The research on recognition and classification of formulaic language has an important promoting effect on improving the standardization of text expression,the accuracy of semantic mining,the authenticity of machine translation,and the logic of intelligent question answering.However,traditional research mainly relies on linguists to manually identify and classify formulaic language,which is costly and inefficient.In recent years,some researchers have begun to propose methods for automatic recognition and classification of formulaic language based on statistical machine learning,but these methods often fail to strike a balance between efficiency and accuracy.To address the problem of high manual recognition costs and poor automatic classification performance,this article proposes a method for recognition and classification of formulaic language based on deep learning.The main contributions and innovations are as follows:(1)In response to the lack of coarse corpus screening in existing methods,which results in a large and complex number of formulaic language recognition samples and low efficiency,this paper proposes a method for predicting sentences containing formulaic language based on multifeature fusion.This method first constructs a classification model to determine whether a sentence contains formulaic language.The model uses the semantic and part-of-speech features of the sentences in a late fusion way to predict the probability that the input sample contains formulaic language.Then,sentences with probabilities exceeding the threshold are retained as samples for subsequent formulaic language recognition,thereby reducing the sample size and improving recognition efficiency through initial screening.Experiments on the academic phrase library and collections of papers show that the method is effective in filtering coarse corpus,which lays a foundation for the subsequent research on formulaic language recognition.(2)In response to the problem of low accuracy in formulaic language recognition caused by incomplete feature extraction in existing methods,this paper proposes a method for identifying formulaic language based on GCN fusing associated information.Considering the characteristics of high co-occurrence frequency and correlation of the various words that make up the formulaic language,this method constructs each sentence into a graph.The words in the sentence are nodes,and the part-of-speech features and semantic features of late fusion are basic features of the nodes.The edges of the connection nodes are determined by using the point mutual information values and the dependency syntactic relationships between words,and then uses graph convolution neural networks to extract the association information between words.Finally,the extracted feature information is input into the conditional random field for decoding,and the label category of each word is obtained,so that the formulaic language can be recognized.The experimental results show that the F1 Score of this method reaches 83.5%,significantly higher than existing methods,verifying the effectiveness of this method in recognizing formulaic language.(3)In response to the problem that it is difficult to break through its own limit by using a single classifier in existing methods,resulting in poor classification performance,this paper proposes a method of formulaic language classification based on Bi-LSTM and Stacking.This method utilizes Glo Ve and Bi-LSTM to extract features from text,and introduces the Stacking ensemble learning algorithm.By using the Pearson correlation coefficient,the method selects Logistic Regression,Random Forest,Multilayer Perceptron,and K-Nearest Neighbor as base classifiers with low correlation,Random Forest as the meta classifier.Finally,the model’s performance is evaluated based on the meta-classifier’s prediction results.Comparative experiments have shown that the Precision,Recall,and F1 Score of our method are 1.36%,2.56%,and 2.64% higher than those of the Bagging ensemble learning model.This verifies that the method can integrate the classification results of multiple single classifiers and further improve classification effect. |