# Find the similarity metric between two strings

How do I get the probability of a string being similar to another string in Python?

I want to get a decimal value like 0.9 (meaning 90%) etc. Preferably with standard Python and library.

e.g.

``````similar("Apple","Appel") #would have a high prob.

similar("Apple","Mango") #would have a lower prob.
``````

69

There is a built in.

``````from difflib import SequenceMatcher

def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
``````

Using it:

``````>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0
``````
Tuesday, June 1, 2021

42

The best package I've seen for this is Gensim, found at the Gensim Homepage. I've used it many times, and overall been very happy with it's ease of use; it is written in Python, and has an easy to follow tutorial to get you started, which compares 9 strings. It can be installed via pip, so you won't have a lot of hassle getting it installed I hope.

Which scoring algorithm you use depends heavily on the context of your problem, but I'd suggest starting of with the LSI functionality if you want something basic. (That's what the tutorial walks you through.)

If you go through the tutorial for gensim, it will walk you through comparing two strings, using the Similarities function. This will allow you to see how your stings compare to each other, or to some other sting, on the basis of the text they contain.

If you're interested in the science behind how it works, check out this paper.

Saturday, July 10, 2021

90
``````max(abs(x - y) for (x, y) in zip(values[1:], values[:-1]))
``````
Monday, August 2, 2021

54

The shortest distance between two skew lines (lines which don't intersect) is the distance of the line which is perpendicular to both of them.

If we have a line l1 with known points p1 and p2, and a line l2 with known points p3 and p4:

``````The direction vector of l1 is p2-p1, or d1.
The direction vector of l2 is p4-p3, or d2.
``````

We therefore know that the vector we are looking for, v, is perpendicular to both of these direction vectors:

``````d1.v = 0 & d2.v = 0
``````

Or, if you prefer:

``````d1x*vx + d1y*vy + d1z*vz = 0
``````

And the same for d2.

Let's take the point on the lines l1, l2 where v is actually perpendicular to the direction. We'll call these two points i1 and i2 respectively.

``````Since i1 lies on l1, we can say that i1 = p1 + m*d1, where m is some number.
Similarly, i2 = p3 + n*d2, where n is another number.
``````

Since v is the vector between i1 and i2 (by definition) we get that v = i2 - i1.

This gives the substitutions for the x,y,z vectors of v:

``````vx = i2x - i1x = (p3x + n*d2x) - (p1x + m*d1x)
``````

and so on.

Which you can now substitute back into your dot product equation:

``````d1x * ( (p3x + n*d2x) - (p1x + m*d1x) ) + ... = 0
``````

This has reduced our number of equations to 2 (the two dot product equations) with two unknowns (m and n), so you can now solve them!

Once you have m and n, you can find the coordinates by going back to the original calculation of i1 and i2.

If you only wanted the shortest distance for points on the segment between p1-p2 and p3-p4, you can clamp i1 and i2 between these ranges of coordinates, since the shortest distance will always be as close to the perpendicular as possible.

Sunday, August 15, 2021

32

Iterating in Python can be quite slow. It's always best to "vectorise" and use numpy operations on arrays as much as possible, which pass the work to numpy's low-level implementation, which is fast.

`cosine_similarity` is already vectorised. An ideal solution would therefore simply involve `cosine_similarity(A, B)` where A and B are your first and second arrays. Unfortunately this matrix is 500,000 by 160,000 which is too large to do in memory (it throws an error).

The next best solution then is to split A (by rows) into large blocks (instead of individual rows) so that the result still fits in memory, and iterate over them. I find for your data that using 100 rows in each block fits in memory; much more and it doesn't work. Then we simply use `.max` and get our 100 maxes for each iteration, which we can collect together at the end.

This way strongly suggests we do an additional time save, though. The formula for the cosine similarity of two vectors is u.v / |u||v|, and it is the cosine of the angle between the two. Because we're iterating, we keep recalculating the lengths of the rows of B each time and throwing the result away. A nice way around this is to use the fact that cosine similarity does not vary if you scale the vectors (the angle is the same). So we can calculate all the row lengths only once and divide by them to make the rows unit vectors. And then we calculate the cosine similarity simply as u.v, which can be done for arrays via matrix multiplication. I did a quick test of this and it was about 3 times faster.

Putting it all together:

``````import numpy as np

# Example data
A = np.random.random([500000, 100])
B = np.random.random([160000, 100])

# There may be a proper numpy method for this function, but it won't be much faster.
def normalise(A):
lengths = (A**2).sum(axis=1, keepdims=True)**.5
return A/lengths

A = normalise(A)
B = normalise(B)

results = []

rows_in_slice = 100

slice_start = 0
slice_end = slice_start + rows_in_slice

while slice_end <= A.shape:

results.append(A[slice_start:slice_end].dot(B.T).max(axis=1))

slice_start += rows_in_slice
slice_end = slice_start + rows_in_slice

result = np.concatenate(results)
``````

This takes me about 2 seconds per 1,000 rows of A to run. So it should be about 1,000 seconds for your data.

Tuesday, November 2, 2021