| Speech is at once the most natural way of communicating with others and also a survival skill.Automatic speech recognition(ASR)is a common speech technology in people’s daily life that can recognize the speaker’s intention,facilitating natural and eficient human-computer interaction(HCI).Silent speech recognition(SSR)based on surface electromyography(sEMG)addresses the limitations of ASR in certain specific scenarios,and it could record muscular activities related to vocalization by collecting the electrophysiological signals from the skin surface using non-invasive electrodes.During the speaker’s vocalization process,the sEMG of articulatory muscles,i.e.,facial and laryngeal muscles,is the intuitive response to the vocal nerve commands of human body.Analyzing sEMG signals related to articulatory muscles is benifical to understanding the speaker’s speech intent.However,the current SSR technology is still insufficient in describing and decoding the timing information of sEMG-based silent speech.The existing sEMG-based continuous speech recognition decoding methods rely on sEMG data with acoustic feedback and marked with phoneme/character labels,limiting the development of continuous speech decoding techniques in SSR.Extracting discriminative feature representations from sEMG,and investigating classification and decoding methods that do not rely on labeled data are key issues in building robust and practical SSR systems.In this study,a flexible high-density(HD)electrode array was widely applied to simultaneously record multichannel sEMG signals from a number of target muscles or muscle groups in relatively large areas,capturing rich spatial information of muscle activities.To achieve high performance in SSR,we proposed a series of advanced methods for classifying and decoding sEMG signals,utilizing deep learning neural networks with adaptive nonlinear learning capabilities.The main work of this dissertation is summarized as follows:(1)Research on silent speech classification based on spatio-temporal information fusion:The existing classification methods failed to effectively utilize the spatial information between channels of HD-sEMG,a hybrid neural network recognition framework based on spatio-temporal information characterization was proposed,which can deeply integrate the spatial and temporal information to improve SSR performance.Specifically,a multi-channel feature extraction from time,frequency and spatial domain was developed,relying on the abundant muscle activity information from HDsEMG.In addition,a hybrid deep learning recognition network combining convolutional and long short-term memory neural networks was designed to accurately characterize the spatio-temporal information of muscle activity captured by HD-sEMG,achieving phrase-level pattern classficiation.When considering the influence from some anomaly patterns of muscle activities,we referenced the advantage of auto encoder(AE)that depicted target motion patterns at a fine granularity level to detect various types of anomaly patterns,providing a solution for efficient and robust implementation of SSR against anomaly pattern interferences.To verify the effectiveness of the proposed method for target pattern recognition and anomoly pattern detection,experimental data were recorded using HD-sEMG arrays from 11 subjects subvocalizing 33 Chinese phrases and articulating 9 anomaly patterns.The proposed method achieved the highest anomaly detection rate while maintaining a high level of target pattern classification accuracy.(2)Research on silent speech decoding based on connectionist temporal classification(CTC)method:To address the difficulty of marking character/syllablelevel labels and timing information decoding in silent speech sequences,an adaptive label alignment method using CTC algorithm was proposed.The proposed method used a deep learning network that integrated spatio-temporal information to characterize the features.By calibrating the labels of input sequences at different frames,the syllablelevel labels were mapped to corresponding sEMG,and an association between sEMG and speech intentions was established.Finally,an end-to-end SSR system could be achieved using CTC decoder and language model(LM).In addition,the CTC-based decoding method was based on the characterization of temporal information at each frame,and it was easier to obtain the fine-grained representations from the target pattern.The anomaly patterns can be detected when the CTC-based decoding method was combined with the AE module.In order to verify the effectiveness of the proposed method and compare with the classification-based SSR method with anomaly pattern detection ability,the dataset consistent with the above work was adopted.The experimental results showed that,compared with the classification-based SSR method,the CTC decoding-based method showed better recognition results in both target pattern classification accuracy and anomaly pattern detection rate.(3)Research on silent speech decoding based on encoder-decoder framework.In order to address the problem of efficient codec of syllable/character level information in continuous sentences,a SSR method based on encoder-decoder framework was proposed.The proposed method explored the relevance in context semantics in the global information,and decoded the output frame by frame.The encoder was used to characterize the sequence into feature representation and the attention mechanism focused on learning global information.Then,the decoder decoded the target sequence word by word using the global information learned from the attention mechanism to achieve syllable-level SSR,improving the accuracy and practicality of SSR.On the basis of Transformer,a conv-transformer model with spatial information characterization was designed,and better recognition results were obtained.The effectiveness of the proposed method was verified with HD-sEMG data of fifteen subjects sub vocalizing 33 Chinese phrases consisting of 82 syllables.The proposed method outperformed the benchmark methods by achieving the highest phrase classification accuracy.This study provides a continuously efficient silent speech decoding method for sEMG-based SSR. |