Through in-depth research on single-channel speech separation algorithms,this paper deeply analyzes the design concepts of existing speech separation algorithms,and finds two areas that can be improved: First,the separation method is not reasonable enough,and the existing speech separation algorithms often The number of pure voices that can be separated is fixed.This idea solves the problem of voice separation to a certain extent,but the model trained in this way is not flexible enough to cope with changing usage scenarios.Second,overreliance on prior conditions,when solving the problem of speech separation,researchers often use certain prior conditions,such as fixing the number of speakers,pre-obtaining the voiceprint features of the target speaker,etc.,and based on this However,the addition of prior conditions will limit the usage scenarios of the model.This paper studies the above problems and gives the corresponding solutions.First of all,in view of the problem that the separation method is not reasonable enough,this paper proposes a speech separation algorithm Cond-Conv Tasnet that fuses auxiliary information,and designs an auxiliary structure that can capture speaker information to obtain the voiceprint information of the target speaker.Different fusion methods in Dot,Fi LM and CLN are used to fuse the voiceprint information of the target speaker with the hidden features in the model,so as to enhance the relevant features of the target speaker in the hidden features,thus prompting the model to generate the target The pure voice of the speaker.In addition,when calculating the loss,referring to the multi-channel optimization of the model in the PIT training process,in addition to the loss on the target speaker’s voice,the loss on the non-target speaker’s voice is also added.The experimental results show that the proposed Cond-Conv Tasnet algorithm is better than the benchmark model Conv Tasnet in terms of separation effect,with 0.008,0.285,and 0.448 improvements in STOI,SI-SNR,and SDR,respectively.Secondly,in order to solve the problem that speech separation relies too much on prior conditions,the characteristics of the sequence-to-sequence network structure are analyzed,and a sequence-to-sequence-based adaptive solution for human voice separation(Self-AdaptionDPTNet,SA-DPTNet).Combining a standard sequence-to-sequence model with the speech separation problem allows the model to recursively predict the output sequence.Under the condition that the amount of data is sufficient and the number of speakers is known,the performance of SA-DPTNet is basically the same as that of the benchmark model DPTNet,with performances of 0.755,8.073,and 8.736 in STOI,SI-SNR,and SDR,respectively.If the number of speakers is not provided,the model can also achieve performances of 0.699,5.869,and 6.554.The algorithm has certain practicability when dealing with speech separation problems in unknown scenarios. |