Why Vectors? And the Importance of Cosine Similarity

Jonathan Schein
4 min readAug 30, 2020

--

Before I begin I would like to first define a few technical/math terms to describe certain key terms that will be used in this article. The purpose of this blog is to bring some clarification to the reason why we use vectors to describe data in data science problems.

  • Scalar: A scalar is a number that is either positive or negative. It represents the magnitude, which is the size of something. One example of a scalar quantity would be the temperature of something. The main missing feature from a scalar is the direction.
  • Magnitude: Size
  • Vector: A vector has a magnitude, but it also has a direction. The direction that is associated with the magnitude is the velocity or the force of the object being measured.

Example:

  • If a car drove 30 miles per hour then that would be considered a scalar.
  • If a car drove 30 miles per hour north then that is considered a vector.

Example:

  1. If we’re trying to figure out which two players on an NBA team are most similar in physical characteristics (i.e. height and weight) then we can do this using vectors.
  2. If we’re figuring out which two countries are most similar using their GDP and their life expectancy.

Both GDP and life expectancy are both quantities without direction, therefore they are both scalars by themselves. But, when you look at the overall country, you look at it as a vector even though there is no “direction”. This is because this is a two dimensional space with respect to each other. Therefore, each country technically has a magnitude (scalar) and direction (vector) with respect to each other.

This is why every observation in a data set is considered a vector. All these observations are in a “vector space” and each vector has its own space in this theoretical place. This has a lot of benefits but the main one is that we are now able to find patterns between the different data observations / vectors in the vector space.

What does this do?

Looking at these vectors in the vector space and analyzing them can help one find patterns and reach conclusions. This where cosine similarity comes into play. The main way to identify similar data points is by the angle between them. The smaller the angle between two vectors, the more similar they are.

  • 0 degrees: if two vectors are in the same direction then their angle is going to be close to 0 degrees which means that they are very similar.
  • 180 degrees: If two vectors are in the complete opposite direction then the angle between them will be close to 180 degrees which shows that these two data observations are complete opposites.
  • 90 degrees or orthogonal. If two vectors have a 90 degree angle between them then there is no relationship between them.

Cosine Similarity

The reason we use cosine similarity is because this formula gives the angle between two vectors (or two data observations). The formula for calculating the cosine similarity is below.

It is the dot product of the two vectors divided by the product of the length or magnitude of the two vectors

Example using Scikit Learn and NLP:

Step #1: “Import” the documents or write out the strings

documents = (“The sky is blue”, “The sun is bright”, “The sun in the sky is bright”, “We can see the shining sun, the bright sun”)

Step #2: Import the TF-IDF vectorizer and then transform the document into the TF-IDF matrix

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()

tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

Step #3: Measure the cosine similarity between the two the first string/document and all the others.

from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)

Result: array([[1. , 0.36651513, 0.52305744, 0.13448867]])

Conclusion

  • Cosine similarity is a very important way to measure the similarity between two vectors.
  • Vectors are just a fancy way of saying data/data observations.
  • There are many applications to this idea but NLP and recommendation systems are two very important ones.

--

--

Jonathan Schein
Jonathan Schein

Written by Jonathan Schein

Data Scientist, Brandeis University Alum and Flatiron School Alum

No responses yet