Asked  6 Months ago    Answers:  5   Viewed   59 times

The contents of this post were originally meant to be a part of Pandas Merging 101, but due to the nature and size of the content required to fully do justice to this topic, it has been moved to its own QnA.

Given two simple DataFrames;

left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})
right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})

left

  col1  col2
0    A     1
1    B     2
2    C     3

right

  col1  col2
0    X    20
1    Y    30
2    Z    50

The cross product of these frames can be computed, and will look something like:

A       1      X      20
A       1      Y      30
A       1      Z      50
B       2      X      20
B       2      Y      30
B       2      Z      50
C       3      X      20
C       3      Y      30
C       3      Z      50

What is the most performant method of computing this result?

 Answers

26

Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:

# pandas <= 1.1.X
def cartesian_product_basic(left, right):
    return (
       left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))

cartesian_product_basic(left, right)
# pandas >= 1.2 (est)
left.merge(right, how="cross")
  col1_x  col2_x col1_y  col2_y
0      A       1      X      20
1      A       1      Y      30
2      A       1      Z      50
3      B       2      X      20
4      B       2      Y      30
5      B       2      Z      50
6      C       3      X      20
7      C       3      Y      30
8      C       3      Z      50

How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".

While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.

A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is @senderle's first implementation.

def cartesian_product(*arrays):
    la = len(arrays)
    dtype = np.result_type(*arrays)
    arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(np.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)  

Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames

Disclaimer
These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your own risk!

This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and

def cartesian_product_generalized(left, right):
    la, lb = len(left), len(right)
    idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
    return pd.DataFrame(
        np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))

cartesian_product_generalized(left, right)

   0  1  2   3
0  A  1  X  20
1  A  1  Y  30
2  A  1  Z  50
3  B  2  X  20
4  B  2  Y  30
5  B  2  Z  50
6  C  3  X  20
7  C  3  Y  30
8  C  3  Z  50

np.array_equal(cartesian_product_generalized(left, right),
               cartesian_product_basic(left, right))
True

And, along similar lines,

left2 = left.copy()
left2.index = ['s1', 's2', 's1']

right2 = right.copy()
right2.index = ['x', 'y', 'y']
    

left2
   col1  col2
s1    A     1
s2    B     2
s1    C     3

right2
  col1  col2
x    X    20
y    Y    30
y    Z    50

np.array_equal(cartesian_product_generalized(left, right),
               cartesian_product_basic(left2, right2))
True

This solution can generalise to multiple DataFrames. For example,

def cartesian_product_multi(*dfs):
    idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
    return pd.DataFrame(
        np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))

cartesian_product_multi(*[left, right, left]).head()

   0  1  2   3  4  5
0  A  1  X  20  A  1
1  A  1  X  20  B  2
2  A  1  X  20  C  3
3  A  1  X  20  D  4
4  A  1  Y  30  A  1

Further Simplification

A simpler solution not involving @senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.

def cartesian_product_simplified(left, right):
    la, lb = len(left), len(right)
    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

    return pd.DataFrame(
        np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

np.array_equal(cartesian_product_simplified(left, right),
               cartesian_product_basic(left2, right2))
True

Performance Comparison

Benchmarking these solutions on some contrived DataFrames with unique indices, we have

enter image description here

Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.

Performance Benchmarking Code
This is the timing script. All functions called here are defined above.

from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
       index=['cartesian_product_basic', 'cartesian_product_generalized', 
              'cartesian_product_multi', 'cartesian_product_simplified'],
       columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
       dtype=float
)

for f in res.index: 
    for c in res.columns:
        # print(f,c)
        left2 = pd.concat([left] * c, ignore_index=True)
        right2 = pd.concat([right] * c, ignore_index=True)
        stmt = '{}(left2, right2)'.format(f)
        setp = 'from __main__ import left2, right2, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=5)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()


Continue Reading

Jump to other topics in Pandas Merging 101 to continue learning:

  • Merging basics - basic types of joins

  • Index-based joins

  • Generalizing to multiple DataFrames

  • Cross join *

* you are here

Tuesday, June 1, 2021
 
huhushow
answered 6 Months ago
95

You can do this pretty straightforwardly with an implicit class and a for-comprehension in Scala 2.10:

implicit class Crossable[X](xs: Traversable[X]) {
  def cross[Y](ys: Traversable[Y]) = for { x <- xs; y <- ys } yield (x, y)
}

val xs = Seq(1, 2)
val ys = List("hello", "world", "bye")

And now:

scala> xs cross ys
res0: Traversable[(Int, String)] = List((1,hello), (1,world), ...

This is possible before 2.10—just not quite as concise, since you'd need to define both the class and an implicit conversion method.

You can also write this:

scala> xs cross ys cross List('a, 'b)
res2: Traversable[((Int, String), Symbol)] = List(((1,hello),'a), ...

If you want xs cross ys cross zs to return a Tuple3, however, you'll need either a lot of boilerplate or a library like Shapeless.

Wednesday, June 23, 2021
 
cbcp
answered 6 Months ago
42

I'm not sure it has a name, but you can use a Where clause to filter out those matching values.

string[][] arrayOfArrays =
    array1.SelectMany(left => array1, (left, right) => new string[] { left, right })
          .Where(x => x[0] != x[1])
          .ToArray();
Thursday, August 26, 2021
 
TheFrack
answered 3 Months ago
82

Demo:

In [280]: A
Out[280]:
time
2017-09-01 01:00:00    0.5
2017-09-01 02:00:00    0.4
Name: val, dtype: float64

In [281]: B
Out[281]:
time
2017-09-01 00:00:00         NaN
2017-09-01 00:03:00   -0.000350
2017-09-01 00:06:00    0.000401
Name: val, dtype: float64

In [282]: B.to_frame('B').join(A.to_frame('A').set_index(A.index.shift(-1, freq='H')).resample('3T').ffill())
Out[282]:
                            B    A
time
2017-09-01 00:00:00       NaN  0.5
2017-09-01 00:03:00 -0.000350  0.5
2017-09-01 00:06:00  0.000401  0.5
Saturday, August 28, 2021
 
hohner
answered 3 Months ago
71

You need to divide this problem in parts

  1. Find the corresponding close indices
  2. Join the DataFrames on those indices
  3. do your extra calculations

Find the indices

using np.isclose, this is a very simple generator function which yields a DataFrame containing the index of df1 and df2 which are close for each row of df1

def find_close(df1, df1_col, df2, df2_col, tolerance=1):
    for index, value in df1[df1_col].items():
        indices = df2.index[np.isclose(df2[df2_col].values, value, atol=tolerance)]
        s = pd.DataFrame(data={'idx1': index, 'idx2': indices.values})
        yield s

Then we can easily concatenate these to get use a helper DataFrame containing the different indices.

df_idx = pd.concat(find_close(df1, 'Col0', df2, 'Col2'), ignore_index=True)

To test this I added a 2nd record to df1

df1_str = '''Index, Col0, Col1
0, 1008.5155, n01
1, 510, n03'''
  idx1    idx2
0 0   1
1 0   2
2 1   0

Join the DataFrames

using pd.merge

df1_close = pd.merge(df_idx, df1, left_on='idx1', right_index=True).reindex(columns=df1.columns)
df2_close = pd.merge(df_idx, df2, left_on='idx2', right_index=True).reindex(columns=df2.columns)
df_merged = pd.merge(df1_close, df2_close, left_index=True, right_index=True)
  Col0_x  Col1_x  Col0_y  Col1_y  Col2    Col3    Col4    Col5    Col6    ...
0 1008.5155   n01 0   0   1007.6176   k13 0   k15 k16 ...
1 1008.5155   n01 0   0   1008.6248   k123    0   k25 k26 ...
2 510.0   n03 0   0   510.0103    k03 0   k05 k06 ...

Do the extra calculations

You'll need to rename a few columns, and assign the diff between them, but that should be trivial

Saturday, October 23, 2021
 
Nayan
answered 1 Month ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share