Asked  6 Months ago    Answers:  5   Viewed   19 times

In the pyplot document for scatter plot:

matplotlib.pyplot.scatter(x, y, s=20, c='b', marker='o', cmap=None, norm=None,
                          vmin=None, vmax=None, alpha=None, linewidths=None,
                          faceted=True, verts=None, hold=None, **kwargs)

The marker size

s: size in points^2. It is a scalar or an array of the same length as x and y.

What kind of unit is points^2? What does it mean? Does s=100 mean 10 pixel x 10 pixel?

Basically I'm trying to make scatter plots with different marker sizes, and I want to figure out what does the s number mean.

 Answers

33

This can be a somewhat confusing way of defining the size but you are basically specifying the area of the marker. This means, to double the width (or height) of the marker you need to increase s by a factor of 4. [because A = WH => (2W)(2H)=4A]

There is a reason, however, that the size of markers is defined in this way. Because of the scaling of area as the square of width, doubling the width actually appears to increase the size by more than a factor 2 (in fact it increases it by a factor of 4). To see this consider the following two examples and the output they produce.

# doubling the width of markers
x = [0,2,4,6,8,10]
y = [0]*len(x)
s = [20*4**n for n in range(len(x))]
plt.scatter(x,y,s=s)
plt.show()

gives

enter image description here

Notice how the size increases very quickly. If instead we have

# doubling the area of markers
x = [0,2,4,6,8,10]
y = [0]*len(x)
s = [20*2**n for n in range(len(x))]
plt.scatter(x,y,s=s)
plt.show()

gives

enter image description here

Now the apparent size of the markers increases roughly linearly in an intuitive fashion.

As for the exact meaning of what a 'point' is, it is fairly arbitrary for plotting purposes, you can just scale all of your sizes by a constant until they look reasonable.

Hope this helps!

Edit: (In response to comment from @Emma)

It's probably confusing wording on my part. The question asked about doubling the width of a circle so in the first picture for each circle (as we move from left to right) it's width is double the previous one so for the area this is an exponential with base 4. Similarly the second example each circle has area double the last one which gives an exponential with base 2.

However it is the second example (where we are scaling area) that doubling area appears to make the circle twice as big to the eye. Thus if we want a circle to appear a factor of n bigger we would increase the area by a factor n not the radius so the apparent size scales linearly with the area.

Edit to visualize the comment by @TomaszGandor:

This is what it looks like for different functions of the marker size:

Exponential, Square, or Linear size

x = [0,2,4,6,8,10,12,14,16,18]
s_exp = [20*2**n for n in range(len(x))]
s_square = [20*n**2 for n in range(len(x))]
s_linear = [20*n for n in range(len(x))]
plt.scatter(x,[1]*len(x),s=s_exp, label='$s=2^n$', lw=1)
plt.scatter(x,[0]*len(x),s=s_square, label='$s=n^2$')
plt.scatter(x,[-1]*len(x),s=s_linear, label='$s=n$')
plt.ylim(-1.5,1.5)
plt.legend(loc='center left', bbox_to_anchor=(1.1, 0.5), labelspacing=3)
plt.show()
Tuesday, June 1, 2021
 
aslum
answered 6 Months ago
67

Basically, you're wanting a density estimate of some sort. There multiple ways to do this:

  1. Use a 2D histogram of some sort (e.g. matplotlib.pyplot.hist2d or matplotlib.pyplot.hexbin) (You could also display the results as contours--just use numpy.histogram2d and then contour the resulting array.)

  2. Make a kernel-density estimate (KDE) and contour the results. A KDE is essentially a smoothed histogram. Instead of a point falling into a particular bin, it adds a weight to surrounding bins (usually in the shape of a gaussian "bell curve").

Using a 2D histogram is simple and easy to understand, but fundementally gives "blocky" results.

There are some wrinkles to doing the second one "correctly" (i.e. there's no one correct way). I won't go into the details here, but if you want to interpret the results statistically, you need to read up on it (particularly the bandwidth selection).

At any rate, here's an example of the differences. I'm going to plot each one similarly, so I won't use contours, but you could just as easily plot the 2D histogram or gaussian KDE using a contour plot:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import kde

np.random.seed(1977)

# Generate 200 correlated x,y points
data = np.random.multivariate_normal([0, 0], [[1, 0.5], [0.5, 3]], 200)
x, y = data.T

nbins = 20

fig, axes = plt.subplots(ncols=2, nrows=2, sharex=True, sharey=True)

axes[0, 0].set_title('Scatterplot')
axes[0, 0].plot(x, y, 'ko')

axes[0, 1].set_title('Hexbin plot')
axes[0, 1].hexbin(x, y, gridsize=nbins)

axes[1, 0].set_title('2D Histogram')
axes[1, 0].hist2d(x, y, bins=nbins)

# Evaluate a gaussian kde on a regular grid of nbins x nbins over data extents
k = kde.gaussian_kde(data.T)
xi, yi = np.mgrid[x.min():x.max():nbins*1j, y.min():y.max():nbins*1j]
zi = k(np.vstack([xi.flatten(), yi.flatten()]))

axes[1, 1].set_title('Gaussian KDE')
axes[1, 1].pcolormesh(xi, yi, zi.reshape(xi.shape))

fig.tight_layout()
plt.show()

enter image description here

One caveat: With very large numbers of points, scipy.stats.gaussian_kde will become very slow. It's fairly easy to speed it up by making an approximation--just take the 2D histogram and blur it with a guassian filter of the right radius and covariance. I can give an example if you'd like.

One other caveat: If you're doing this in a non-cartesian coordinate system, none of these methods apply! Getting density estimates on a spherical shell is a bit more complicated.

Tuesday, July 27, 2021
 
relyt
answered 4 Months ago
73

Where would this bx be passed into?

You ought to repeat the second call to plot, not the first, so there is no need for bx.

In detail: plot takes an optional ax argument. This is the axes it draws into. If the argument is not provided the function creates a new plot and axes. In addition, the axes is returned by the function so it can be reused for further drawing operations. The idea is not to pass an ax argument to the first call to plot and use the returned axes in all subsequent calls.

You can verify that each call to plot returns the same axes that it got passed:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(100, 6), columns=['a', 'b', 'c', 'd', 'e', 'f'])


ax1 = df.plot(kind='scatter', x='a', y='b', color='r')    
ax2 = df.plot(kind='scatter', x='c', y='d', color='g', ax=ax1)    
ax3 = df.plot(kind='scatter', x='e', y='f', color='b', ax=ax1)

print(ax1 == ax2 == ax3)  # True

enter image description here

Also, if the plot is the same graph, shouldn't the x-axis be consistently either 'a' or 'c'?

Not necessarily. If it makes sense to put different columns on the same axes depends on what data they represent. For example, if a was income and c was expenditures it would make sense to put both on the same 'money' axis. In contrast, if a was number of peas and c was voltage they should probably not be on the same axis.

Saturday, July 31, 2021
 
Ryan Stewart
answered 4 Months ago
42

I found a way to do it for anyone who stumbles on this anyways.

We need to replace the following line from the OP:

plt.axhline(y=0.002, xmin=0, xmax=1, hold=None)

We replace it with:

ax1.axhline(y=0.002,xmin=0,xmax=3,c="blue",linewidth=0.5,zorder=0)
ax2.axhline(y=0.002,xmin=0,xmax=3,c="blue",linewidth=0.5,zorder=0)
ax3.axhline(y=0.002,xmin=0,xmax=3,c="blue",linewidth=0.5,zorder=0)

This produces:

enter image description here

Sunday, August 1, 2021
 
Gilko
answered 4 Months ago
14

Found it!

import matplotlib.pyplot as plt
from matplotlib.legend_handler import HandlerTuple
import numpy as np
group1 = np.array([[1,4,6],[3,2,5]])
group2 = np.array([[1,5,9],[2,2,5]])
group3 = np.array([[1,4,2],[11,2,7]])
a, =plt.plot(group1[0,:],group1[1,:], 'ro', marker='^')
b, =plt.plot(group2[0,:],group2[1,:], 'bo', marker='o')
c, =plt.plot(group3[0,:],group3[1,:], 'go', marker='s')
plt.legend([(a,b,c)], ['goupdata'], numpoints=1, handler_map={tuple: HandlerTuple(ndivide=None)})
plt.show()

enter image description here

Thanks very much to anyone that at least tried to help!

Update: Something that i found useful; If you want to add more than one entries:

plt.legend([(a,b),(c)], ['goupdata1', 'groupdata2'], numpoints=1, handler_map={tuple: HandlerTuple(ndivide=None)})
Friday, August 6, 2021
 
Dance Party2
answered 4 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share