| Representation of word meaning has long been a fundamental task in natural language processing.Word embeddings has been used in many natural language processing tasks,for example,POS tagging and named entity recognition.Traditional methods treat each word a symbol,which is not cable of modeling semantic and syntactic relationship between two words.Distributed representations encode words as low-dimensional real-valued vectors.Semantic relation of two words can be represented by distance between their corresponding word embeddings.Thus become the most popular representation of words nowadays.Through extensive work on word embeddings,problems remain with low frequency words.In this thesis,we discuss following questions:(1)The reason why word embeddings of low frequency words is less effective(2)Employ interior information of words to improve low frequency words on Chinese.(3)A universal method to boost performance of low frequency words.The main content is as follows:We propose an average similarity based metric basing on distributed word representation.Experiments on different training algorithms,corpora,and languages show that the relation is stable.We further propose a method to distinguish low frequency words.Further apply this method to design a similarity metric,experiments on word similarity show 0.02~0.05 performance boost compared with cosine similarity,.Using radicals of Chinese characters to boost the performance of low frequency words on Chinese.Radicals of Chinese characters often convey some meanings,we share radicals’ weights between low frequency words and high frequency words.Experiment result show 0.02 performance increases.Propose a pseudo context method which is language independent.In pseudo context method,we exploit the context of other words as the context of a low frequency words to augment data.Thus boost the performance of low frequency words. |