Font Size: a A A

Mining Of Semantic Similar Items Based On Cross-Language Mapping

Posted on:2020-01-12Degree:MasterType:Thesis
Country:ChinaCandidate:P HanFull Text:PDF
GTID:2428330578467298Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the blending of world culture and the multi-style dissemination of information,text information processing is still one of the most important research and application fields in information technology.In recent years,with the promotion of various languages in various countries and scholars at home and abroad.The publication of foreign journals and the research under cross-language mapping are particularly important.Cross-language mapping is a problem of acquiring knowledge across language boundaries,and it belongs to the field of data mining.In order to improve the results of Chinese similar vocabulary clustering,after studying the English vocabulary construction,we proposed a cross-language mapping of Chinese translation English,mapping Chinese into English for processing,and obtaining Chinese semantic similarity items.In the subject,the knowledge of cross-language mapping,the calculation of lexical semantic similarity,the extraction of eigenvalues and the clustering algorithm in data mining technology are used in the establishment of bilingual dictionary.The Chinese vocabulary collected through the web crawler is used.Semantic similarity mining.The research contents of this paper are mainly divided into the following points:1.Create a bilingual dictionary of cross-language mappings.Write a web crawler with the theme of the established local Chinese vocabulary.Use the network dictionary to search the local Chinese vocabulary as a keyword,and collect the html files of each Chinese vocabulary.Then,through the data preprocessing operation,data cleaning and data integration of the html file are performed to form a statistical Key-Values key-value pair based on the network dictionary,and a cross-language mapping dictionary is established based on the key-value pairs.The results show that the bilingual dictionary we have built contains more vocabulary and is more beneficial for future experiments.2.Extraction of feature values.On the basis of studying word2 vec,the CBOW model is adjusted,and the lexical vectorization of the Values of the dictionary is performed.The vector matrix formed by vectorization is calculated by PCA algorithm to obtain the eigenvalue.3.The K-means clustering algorithm is used to calculate the distance of the eigenvalues,and the iterative model is used to obtain the clustering results of the eigenvalues.Then,the corresponding relationship is used to obtain the cluster of English vocabulary,and the key of the dictionary is used to query the key to obtain the set of semantic similarities of Chinese vocabulary.The research results in this paper not only obtained the collection of semantic similarities of Chinese vocabulary,but also established a rich bilingual dictionary of Chinese and English,which has certain practical significance and helps to promote the development of Chinese text in the field of natural language processing...
Keywords/Search Tags:Cross-language mapping, lexical vectorization, CBOW model, similarity calculation, clustering algorithm
PDF Full Text Request
Related items