How does the similarity of word2vec

How do you calculate sentence similarity using Gensim's word2vec model with Python?


According to Gensim Word2Vec, I can use the word2vec model in the Gensim package to calculate the similarity between two words.

e.g.

However, the word2vec model cannot predict sentence similarity. I find out the LSI model with sentence similarity in Gensim, but that doesn't seem to be combined with the word2vec model. The length of the corpus of every sentence I have is not very long (less than 10 words). So are there any easy ways to get there?





Reply:


This is actually quite a challenging problem that you are asking. Computing sentence similarity requires building a grammatical model of the sentence, understanding equivalent structures (e.g., "he went to the store yesterday" and "he went to the store yesterday"), and finding similarity not only in the pronouns and verbs , but also in proper names, finding statistical coexistence / relationships in many real text examples, etc.

The simplest thing you could try - although I don't know how well this would work and it certainly wouldn't give you optimal results - would be to remove all "stop" words (words like "that", "a") first "etc, which don't give much meaning to the sentence) and then run word2vec on the words in both sentences, sum the vectors in one sentence, sum the vectors in the other sentence, and then find the difference between the sums. If you sum them up That being said, instead of making a word-by-word distinction, you will at least not be subject to word order. That being said, this will fail in many ways and is by no means a good solution (although good solutions to this problem almost always require a certain amount of NLP, machine learning, and other cleverness ).

So the short answer is no, there is no easy way to do this (at least not to do it well).





Since you are using gensim, you should probably use the doc2vec implementation. doc2vec is an extension of word2vec on phrase, sentence and document level. It's a pretty straightforward extension that is described here

http://cs.stanford.edu/~quocle/paragraph_vector.pdf

Gensim is nice because it's intuitive, fast, and flexible. What is great is that you can get the pre-trained word embeds from the official word2vec page and the syn0 layer of the Doc2Vec model is exposed by gensim so you can do the word embeds with these high quality vectors!

GoogleNews-vectors-negative300.bin.gz (as linked in Google Code)

I think Gensim is definitely the easiest (and for me the best so far) tool for embedding a sentence in vector space.

There are sentence-to-vector techniques other than those suggested in Le & Mikolov's article above. Socher and Manning from Stanford are certainly two of the most famous researchers in this field. Her work is based on the principle of composition - the semantics of the sentence come from:

You have suggested some such models (which are becoming increasingly complex) for using composition to create sentence-level compositions.

2011 - Development of a recursive autoencoder (very comparatively simple. Start here if you are interested)

2012 - Matrix-Vector-Neural Network

2013 - neural tensor network

2015 - Tree LSTM

His papers are all available on socher.org. Some of these models are available, but I would still recommend gensims doc2vec. For one thing, the URAE 2011 is not particularly powerful. In addition, it is equipped with weights suitable for paraphrasing message data. You cannot retrain the network with the code he provided. You also can't swap into different word vectors, so you won't get anywhere with the embeds of turian before Word2vec from 2011. These vectors are certainly not at the word2vec or GloVe level.

I haven't worked with the Tree LSTM yet, but it looks very promising!

tl; dr yes, use gensims doc2vec. But there are other methods too!



When using word2vec you need to calculate the average vector for all words in each sentence / document and use cosine similarity between vectors:

Calculate similarity:






You can use Word Mover's distance algorithm. Here is a simple description about weapons of mass destruction.

Ps: If there is an error importing the Pyemd library, you can install it with the following command:




Once you have the sum of the two sets of word vectors, you should take the cosine between the vectors, not the diff. The cosine can be calculated by taking the point product of the two normalized vectors. Thus the word count is not a factor.



There is a function from the documentation that creates a list of words and compares their similarities.


I want to update the existing solution to help the people who will be calculating the semantic similarity of sentences.

Step 1:

Load the appropriate model with gensim, calculate the word vectors for the words in the sentence and save them as a word list

Step 2: Calculate the sentence vector

Calculating semantic similarity between sentences has previously been difficult, but recently an article titled "SIMPLE BUT HEAVY BASELINE FOR SENTENCE EMBEDDINGS" has been proposed which suggests a simple approach by calculating the weighted average of the word vectors in the sentence and then removing the Projections of the mean vectors onto their first principal component. Here the weight of a word is w a / (a ​​+ p (w)), where a is a parameter and p (w) is the (estimated) word frequency, which is called the smooth inverse frequency.This method works much better.

A simple code to compute the set vector using SIF (smooth inverse frequency) to the proposed method in the paper was given here

Step 3: Using sklearn cosine_similarity Load two vectors for the sentences and calculate the similarity.

This is the easiest and most efficient way to calculate sentence similarity.



I use the following method and it works fine. You have to run a POSTagger first and then filter your sentence to remove the stop words (determinants, conjunctions, ...). I recommend TextBlob APTagger. Then create a word2vec by taking the mean of each word vector in the sentence. The n_similarity method in Gemsim word2vec does just that by passing two sets of words to compare.







Gensim implements a model called Doc2Vec for embedding paragraphs.

There are several tutorials presented as IPython notebooks:

Another method based on Word2Vec and Word Mover's Distance (WMD) as shown in this tutorial:

An alternative solution would be to rely on average vectors:

If you can run Tensorflow, you can try: https://tfhub.dev/google/universal-sentence-encoder/2


I tried the methods of the previous answers. It works, but the main disadvantage of this is that the longer the sentences, the greater the similarity (to calculate the similarity, I use the cosine score of the two middle embeddings of any two sentences), since the more words, the more more positive the semantic effects is added to the sentence.

I thought I should change my mind and use sentence embedding instead, as explored in this and this article.


The Facebook research group has released a new solution called InferSent. Results and code will be published on Github. Check their repo. It's pretty awesome. I plan to use it. https://github.com/facebookresearch/InferSent

her paper https://arxiv.org/abs/1705.02364 Summary: Many modern NLP systems are based on word embeddings that were previously trained unsupervised on large corpora as basic features. However, efforts to preserve embeds for larger blocks of text such as sentences have not been as successful. Multiple attempts to learn unsupervised representations of sentences are not satisfactory enough to be widespread. In this article, we show how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference records can consistently outperform unsupervised methods like SkipThought vectors in a wide variety of transmission tasks. Similar to computer vision ImageNet is used to obtain functions that can then be transferred to other tasks. Our work tends to show the suitability of natural language inference for transferring learning to other NLP tasks. Our encoder is publicly available.


If you don't use Word2Vec we have another model to help you find it with BERT for embedding. Below is the reference link https://github.com/UKPLab/sentence-transformers

Another link follows https://github.com/hanxiao/bert-as-service

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.

By continuing, you consent to our use of cookies and other tracking technologies and affirm you're at least 16 years old or have consent from a parent or guardian.

You can read details in our Cookie policy and Privacy policy.