Learning sentence representations is a very important research topic in the field of natural language processing(NLP).In this field,we use embeddings of sentences to deal with most of the problems,such as text clustering,text classification,and text generation.The quality of sentence embeddings has a direct impact on the performance of the downstream tasks.Thus,how to learn effective sentence embeddings is a popular research topic in the NLP field.In recent years,contrastive learning shows its outstanding effectiveness and gradually plays a main role in sentence representation learning.However,current contrastive learning-based research for sentence embedding learning focuses on the improvement of model architecture and data augmentation skills.These studies ignore the redundant information stored in pre-training datasets and the redundant information will negatively affect the performance of downstream tasks.Based on this dilemma,this paper focuses on studying how to drop the redundant information in the pre-training datasets and training models to learn superior sentence representations which behave better on the downstream tasks.The main work and contributions of this paper include the following three points:Research 1.For addressing existing contrastive learning based unsupervised sentence representation learning models’ challenge of ignoring the redundant information in the pre-training datasets,we design a reconstruction task to effectively discard the redundant information.In this research,we start from the information minimization principle originating from information theory and derive the formula of the reconstruction task via rigorous mathematical proof.We pass the same sentence twice to the sentence encoder to obtain a pair of positive instances and the other sentences in the same mini-batch are seen as negative instances.The reconstruction task proposed forces one positive instance to reconstruct the other positive instance to discard the noisy information in the sentence embeddings.In the section on the experiment,we evaluate the method proposed via the main experiment and ablation studies.Experimental results show that our method effectively advances the performance of unsupervised and supervised tasks.Compared to previous models,our models obtain a new state-of-the-art(SOTA)performance.Research 2.For addressing the problem that existing unsupervised sentence representation learning studies have used real data as the training set,but the bias and privacy factors contained in real data as well as the difficulty of collecting high-quality data have limited the application of the model in reality,we propose the first solution of using generative models as the data source of training data for sentence representation learning models.Specifically,this study first generates a large amount of synthetic data through a language generation model.Secondly,this study uses continuous-space and discrete-space data augmentation to construct positives and negatives and compares the proposed discrete-space data augmentation method to other data augmentation methods.This study also systemically compares the performance of models using real datasets,using synthetic datasets,and using mixed real-synthetic datasets as the training corpus.The experimental results confirm the effectiveness of using synthetic datasets to train the models.Research 3.For addressing the shortcomings of existing sentence representation models that use negative instances,which may lead to false negative instances,and the fact that existing sentence representation models that use negative instances are not easy to implement in reality,this paper conducts a study on sentence representation models that do not use negative instances.In this paper,we focus on improving the performance of the model on various downstream tasks while eliminating the shortcomings of negative instances for sentence representation learning.In terms of model design,we take advantage of a Siamese network with two branches.Encoders in two branches share their parameters.An input sentence is passed to the encoder with a specific dropout value and the correlation of the hidden outputs of two branches is enhanced via a cross-correlation matrix,which avoids the learned embeddings degenerating into a trivial solution.We perform a prediction task on the output vectors of encoders to forget the redundant information stored in embeddings.We also find that adding an MLP layer on top of the encoder during the training and test phase brings extra gains in the performance of downstream tasks.We further perform a careful analysis of important components in the model.We conduct lots of experiments to evaluate our model on both unsupervised and supervised tasks.Results show that the proposed methods in the study separately improve the performance of the model.Sentence representations learning is a critical research point followed by many scholars in the NLP field.It is important to improve the learned embeddings via the improvement of model architecture,data augmentation,and pretext tasks.However,it is also critical for us to explore the problem from the standpoint of the pre-training dataset.This paper starts from viewpoint of information theory and considers what conditions the ideal sentence embeddings learned from pre-training datasets require.We explore the solution to three types of contrastive learning frameworks:reconstruction between positive samples,leverage of generative models as a data source,and discard of negative samples of the model.The studies conducted in this paper encourage people to rethink the dataset and provide a new perspective on improving the quality of sentence embeddings. |