With the development of home intelligence,various smart devices have emerged.Traditional contact interaction methods are difficult to meet the needs of convenience,and non-contact control forms such as voice control emerge as the times require,which shift the input of the device from physical contact to language information,providing a relatively natural control method.However,whether it is contact or non-contact interaction,most of them are based on single-modal device input;When this modality is compromised,the device input becomes blurred and cannot respond properly,thus affecting the interactive experience.In view of the above problems,this paper applies multi-modal fusion to the human-computer interaction control of home equipment,integrates voice and gesture recognition,and designs a multi-modal human-computer interaction method based on noncontact interaction.This paper studies the natural human-computer interaction method of smart home with multi-modal fusion from the following three aspects:(1)Aiming at the problem of large interference in complex background gesture recognition in home environment,a dual-stream fusion framework of skeleton and depth map is proposed to realize dynamic gesture recognition.First,the BILSTM network structure is used to extract features from 2D bone information,and CNN and BILSTM are used to extract features from 2D depth maps.Then,various fusion methods are studied,concat stitching based on feature fusion,LMF low-rank weight decomposition,and fractional fusion-based maximum value fusion as well as mean fusion.The fused features contain more gesture information,and the recognition results are obtained through the classification layer.The results show that the dual-stream fusion is more conducive to improving the recognition accuracy than the single-modality method.(2)Aiming at the problem of low accuracy of speech recognition due to large noise disturbance in the home environment,an end-to-end speech recognition method based on Deep Speech2 model fine-tuning is proposed to realize the recognition of streaming speech.First,the speech features are extracted using linear spectrogram preprocessing method,then CNN and GRU are used as acoustic models to realize the mapping of speech features to phonemes.Finally,the pre-trained language model is called and the decoding results are optimized by combining with the cluster search algorithm.The results show that the accuracy of speech recognition is improved compared with the traditional method.(3)Aiming at the problem of low flexibility of single control mode in sensing command mode of household equipment,a multi-mode human-computer interaction method was proposed.First collect speech signals and gesture motion sequences,then call the corresponding trained model to identify the corresponding mode.Finally,the gesture and speech recognition are combined for the above two modalities,the method of late fusion is adopted,and the recognition results are weighted and fused,realize multiple sensing methods of the device,improve household equipment identification accuracy,so as to make the right responses.Based on the above work,a multi-mode fusion human-computer interaction system is built by using Python Web framework Flask,voice acquisition equipment is used to acquire voice signals,and depth camera is used to collect bone and depth information as the input of the above model.The human-computer interaction mode studied in this paper is oriented to the scene of smart home,integrating the voice and gesture in the non-contact interaction mode,using multi-modal mode to make up for the deficiency of single mode,so that the device can receive a variety of input modes,so as to improve the accuracy of response and improve the human-computer interaction mode of smart home. |