| With the breakthroughs in the field of artificial intelligence in recent years,artificial intelligence algorithms have been applied to various fields of work,represented by artificial intelligence technology in audio processing,which has unconsciously penetrated into people’s daily lives,such as audio state recognition,speech recognition,speech synthesis,etc.,and has begun to relieve the heavy repetitive labor pressure of practitioners in many work fields.The research on artificial intelligence technology based on deep learning is still undergoing updates and iterations.Language technologies such as automatic speech recognition applied in the field of communication can help users,such as banks,futures companies,and securities companies,complete a large number of repetitive telephone work tasks in combination with outbound calling systems.On the one hand,it reduces the large number of ineffective dialing tasks for manual agents,Instead,focus more on dialing tasks that require manual processing.On the other hand,it can also be combined with artificial intelligence technology in other fields to analyze and process call data,allowing the data to generate greater benefits.So how to better optimize audio state recognition,speech recognition,and their supporting algorithms will also become one of the valuable research.This article studies the voice service system of an intelligent outbound call system,and the main work is to complete the following tasks:1.For audio state recognition,a network framework using lightweight convolutional neural networks for audio state recognition is proposed,which combines the characteristics of asymmetric convolution and audio signals in state recognition tasks,optimizes the parameters of the convolutional network,and addresses the problem of insufficient feature extraction in traditional convolutional pyramid structures,We have constructed a convolutional feature extraction network framework based on the Time Frequency Separation Convolutional Framework(SFX),and introduced various convolutional attention mechanisms based on the SFX network framework to achieve the Sequence Frequency Separation Probability Net(SFP Net)The Sequence Frequency Separated Convolutional Multiscale Attention Network(SFM Net)and Sequence Frequency Separated Convolutional Attention Channel(SFAC)models were used to compare network performance on multiple datasets.In datasets that are more suitable for practical scenarios,SFAC networks can achieve better recognition performance while maintaining lower complexity.2.For the speech recognition model,the traditional CNN-CTC end-to-end acoustic model has been improved,and a time-series convolution with better parallel performance has been added as the extraction module for sequence information,enhancing the encoding end’s ability to extract the correlation between the input sequence and the background.At the same time,the residual connections were changed to dense connections in the convolutional block,allowing the gradients of the shallow network to flow better,improving the feature extraction ability,and adding an attention layer to enhance the feature extraction effect.An end-to-end speech recognition model based on multi-scale convolutional blocks was constructed.Test results on public datasets showed that the proposed network model can achieve better recognition accuracy.3.The algorithm models for speech recognition,speech synthesis,and audio state recognition based on deep learning models have been deployed in a service-oriented manner.The internal processing flow of the algorithm services and the communication flow based on the Websocket communication protocol have been designed,forming a complete voice service process link from voice data to text and state,and then from text to voice data in the outbound system.Externally,user can access the Free Switch platform through the interface for work,or directly call the interface through the WEB server for service access. |