A Method For Text Similarity Measurement With TF-IDF And Word Semantic Information

Posted on:2016-04-25

Degree:Master

Type:Thesis

Country:China

Candidate:Z M Wang

Full Text:PDF

GTID:2298330470950399

Subject:Network information security

Abstract/Summary:

PDF Full Text Request

Text similarity measures are some ways to measure the degree of semanticsimilarity between two texts. They are the Natural Language Processing (NaturalLanguage Processing, NLP) in a very important task, and they are also the foundationof many downstream applications. Text similarity measures have extremelywidespread application in many fields, such as text duplicate detection field, imageretrieval, information retrieval, the automatic generation of text areas, and textclassification.The traditional text similarity measures usually have two kinds, respectively isbased on the statistical method and the method base on semantic analysis. The formeris generally the text as a set of words, in the collection and analysis in the text, thenumber of occurrences of each term, and the number of occurrences of each term inthe text, and then use the word frequency information, on this basis, the text to vectormodeling, then used the cosine similarity between vector and Jaccard coefficient tocalculate the similarity between the text. The latter inspects the general principles ofthe similarity between texts often use domain-specific semantic dictionary toconstruct the semantic relations between words. The comparison of the existingcommon and complete semantic dictionary include: WordNet to study worddisambiguation; HowNet used to carry out the sentence and the semantic similarity ofstudy; Synonym word Lin used to calculate similarity between sentences. Thedisadvantage is based on statistical method ignores the meaning of the text itemmiddle term, also ignores the semantic relations between words. Large text bookpurpose, at the same time, because the term causes text representation model of vectordimensionality and sparse; Based on the method of semantic analysis and need largeknowledge base to build the relationship between semantic words and words,although can be based on the statistical method to some degree of semantic items inthe extension, but also further enhance the text representation of the vector dimension,so also canâ€™t reflect well the similarity between the two text. TF-IDF method is a kind of traditional text similarity measurement methodbased on statistics, mainly using the text word frequency vector model, then usingcosine similarity measure method to calculate the similarity between the text. So inthis article, on the basis of TF-IDF model, at the same time key words in the textanalysis of semantic information, using a new method of text similarity measure.Text preprocessing of the method first, take the natural language processingtechnology, and then use the TF-IDF method in text search for TF-IDF valueshigher keywords. Then with external dictionary word analysis, combined with a kindof weighted similarity of tree, and the definition of text semantic similarity computingthe similar degree between two texts. Finally to benchmark data set for text clusteringexperiment, the comparison superiority. Finally the experimental results show that,this article USES the method in terms of accuracy and recall rate, macro average ofevaluation index is better than that of under the condition of TF-IDF method andanother words based on semantic similarity (the paper called WRSim) method, furtherto verify the effectiveness of the method used in this article.

Keywords/Search Tags:

Semantic analysis, Text classification, Text similarity measurement, Term-frequency document frequency

PDF Full Text Request

Related items

1	Research On Text Classification Based On Deep Learning
2	Research On Feature Selection Algorithm Based On Segmented Term Frequency In Text Classification
3	Research On Chinese Text Classification Based On Semantic Analysis
4	Research And Application Of Feature Selection Based On Term Frequency Reordering Of Document Level
5	Research On Short Text Similarity Measure Based On Semantic Coupling
6	Big Data Cleaning Algorithm Researches And System Platform Construction
7	Research On Chinese Information Classification Based On Improved Bayesian Algorithms
8	Research And Implementation On Text Classification In Vertical Domain
9	The Research And Application On Text Similarity Measurement Based On Semantic Analysis
10	Research Of Feature Vector Value Weighted Based On Semantic Analysis In Chinese Text Clustering