With the global integration,people with different language backgrounds have more and more opportunities to communicate with each other.As the core front-end processing module of multi-language intelligent speech processing system,language identification technology is one of the research hotspots in the field of intelligent speech processing.In practical application,real-time processing and judgment of speech are often needed,so the demand for short speech language identification is increasing.It is difficult to identify accurately by limited information in a very short time,and it is also a hot research topic in the field of language identification in recent years.In recent years,intelligent voice equipment gradually infiltrates into people’s daily work and life,and people use dialect frequently in daily life.It will be a trend in the future to develop intelligent voice equipment for dialect interaction.Therefore,the task of dialect identification has become an important topic in the research of language identification.This paper studies the short speech language identification and the speech dialect identification,and the main work is as follows:(1)For Short-utterance language identification task,a multi head self attention and dual branch x-vector network MHAtt-Dbxvector Net based on extended delay neural network is proposed.Firstly,MFCC,PLP and fbank are used as the input training network,and the best combination of MFCC and PLP is obtained through experiments;Then,the pooling layer is replaced by multi head self attention mechanism to increase the weight of effective features;In addition,aiming at the imbalance of sample number and the difficulty of classification of some samples,the class weight factor and modulation factor are introduced to improve the loss function of the training model.Finally,the network is used to extract x-vector for language discrimination.X-vector includes deep local features and global context features,which can effectively improve the short speech recognition results.The experimental results show that the equal error rate is 8.15% on the Oriental language data sets,and 12.2% and 9.96% on the Chinese dialect data sets less than or equal to 3s and greater than 3s,respectively.(2)Aiming at the problem of low accuracy of speech dialect identification,a dialect identification network SERes-Bi GRU composed of SERes block and bidirectional gating cycle unit is proposed.First,a feature extraction module SERes block is designed,which uses the combination of empty convolution and common convolution to expand the sense field,and uses SE Block recalibrates the weights of each channel in the feature map;then,the dialect identification network is built by using this module and Bi GRU;finally,the Additive Angular Margin Softmax(AAM Softmax)is introduced instead of the traditional Softmax training network,which further expands the distance between feature classes and reduces the distance within classes.The experimental results show that the equal error rate is 10.72% for speech less than or equal to 3s and 9.90% for speech greater than 3s.Compared with the MHAtt-Dbxvector Net,the effect of dialect identification is improved.(3)Using the method proposed in this paper,a speech language identification system is designed and implemented,which can be used for speech visualization,acoustic feature extraction,and speech or dialect recognition of speech samples.The system can be used not only for local existing speech,but also for recording speech and realizing related functions. |