| With the development of multi-media technology and the Internet,emerging applications such as music communities,short video platforms,social software,film and games often use digital music as a form of circulation and a profit-making tool.Traditional manual music composition methods has become more difficult to meet the individual composition needs of the music market in a timely manner.Therefore,automatic music generation by computer can effectively assist people in music composition.Music can convey content information and express emotional colors through melody,rhythm,tune and other elements.Emotion is an integral part of the semantic expression in music,automatic music generation technology should not only consider the structural information of music,but also incorporate emotional elements.Similar to natural language,music consists of a sequence of notes arranged in chronological order,so the language model in the field of natural language processing can be borrowed to construct an emotion-specific music generation model.Existing models for emotion music generation have certain drawbacks.First,most models use a fully supervised approach based on annotated data to generate music with specific emotion.This approach relies excessively on labeled data.The lack of large-scale and standard emotional labeled datasets in the music domain and the subjectivity of emotional labels also affect the accuracy of emotional judgment by the generative models.Second,existing models lack a certain degree of emotional interpretability and emotional controllability.Interpretability means that the model can correlate the basic features of music in terms of music theory with emotion to give more accurate emotional meaning to the generated music.Controllability implies that users are able to change the emotional state of the output music autonomously during the generation process.Finally,most of the existing methods are based on recurrent neural networks such as GRU and LSTM to build generative models,while music belongs to long sequence data and contains a large amount of event information,these models have limited ability to model on such data,tend to ignore the structural dependencies between contexts,and suffer from gradient disappearance or gradient explosion.In response to the above-mentioned problems,the work of this dissertation mainly includes the following three aspects:(1)A controlled music generation model(Control-VAE)based on Variational Autoencoders(VAE)is proposed.The model introduces a disentangled mechanism with encoder constraints and multi-task loss function constraints to learn latent variable representations of rhythmic and modal features separately from the original music data,and the controlled music generation can be achieved by manipulating the corresponding feature representations.In addition,the encoder and decoder of Control-VAE are constructed using the Transformer-XL network,which can effectively learn the context-dependent structure of more music bars and enhance the model’s ability to focus on different features based on its segment-level loop mechanism and relative positional encoding.(2)A semi-supervised generative model(Semg-GMVAE)based on Gaussian Mixture Variational Autoencoders(GMVAE)is proposed on the basis of Control-VAE.The model associates rhythmic and modal features with different levels of emotion,and the latent variable representations of these two features are inferred with semi-supervised methods for emotional categories to have the ability to express emotion.Music generation and emotion transformation targeting happy,excited,sad,and calm emotions can be achieved by manipulating the feature representations during generation.In addition,a optimization method for the objective loss function of this model is proposed,which penalizes and enhances the variance regularity term and mutual information suppression term in the evidence lower bound of GMVAE to improve the separation of each emotional category cluster in the latent space,so that the model can learn and distinguish music data of different emotional categories more easily.The above work was experimentally evaluated on a symbolic music datasets containing partial emotional labels.The results show that Control-VAE is able to successfully disentangle the rhythmic feature representation and the modal feature representation of music,and that independent control of either feature representation can determine the characteristics of the generated music in terms of that feature.Since Semg-GMVAE possesses efficient semi-supervised performance,strong connection between features and emotion,and sufficient dispersion among different emotional category clusters,the model reduces the dependence on labeled data,and its generated music obtains the highest sentiment prediction accuracy,and also successfully achieves the musical emotion transformation by manipulating features through the disentangled mechanism. |