Asked  7 Months ago    Answers:  5   Viewed   43 times

I wanted to compare reading lines of string input from stdin using Python and C++ and was shocked to see my C++ code run an order of magnitude slower than the equivalent Python code. Since my C++ is rusty and I'm not yet an expert Pythonista, please tell me if I'm doing something wrong or if I'm misunderstanding something.


(TLDR answer: include the statement: cin.sync_with_stdio(false) or just use fgets instead.

TLDR results: scroll all the way down to the bottom of my question and look at the table.)


C++ code:

#include <iostream>
#include <time.h>

using namespace std;

int main() {
    string input_line;
    long line_count = 0;
    time_t start = time(NULL);
    int sec;
    int lps;

    while (cin) {
        getline(cin, input_line);
        if (!cin.eof())
            line_count++;
    };

    sec = (int) time(NULL) - start;
    cerr << "Read " << line_count << " lines in " << sec << " seconds.";
    if (sec > 0) {
        lps = line_count / sec;
        cerr << " LPS: " << lps << endl;
    } else
        cerr << endl;
    return 0;
}

// Compiled with:
// g++ -O3 -o readline_test_cpp foo.cpp

Python Equivalent:

#!/usr/bin/env python
import time
import sys

count = 0
start = time.time()

for line in  sys.stdin:
    count += 1

delta_sec = int(time.time() - start_time)
if delta_sec >= 0:
    lines_per_sec = int(round(count/delta_sec))
    print("Read {0} lines in {1} seconds. LPS: {2}".format(count, delta_sec,
       lines_per_sec))

Here are my results:

$ cat test_lines | ./readline_test_cpp
Read 5570000 lines in 9 seconds. LPS: 618889

$ cat test_lines | ./readline_test.py
Read 5570000 lines in 1 seconds. LPS: 5570000

I should note that I tried this both under Mac OS X v10.6.8 (Snow Leopard) and Linux 2.6.32 (Red Hat Linux 6.2). The former is a MacBook Pro, and the latter is a very beefy server, not that this is too pertinent.

$ for i in {1..5}; do echo "Test run $i at `date`"; echo -n "CPP:"; cat test_lines | ./readline_test_cpp ; echo -n "Python:"; cat test_lines | ./readline_test.py ; done
Test run 1 at Mon Feb 20 21:29:28 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 2 at Mon Feb 20 21:29:39 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 3 at Mon Feb 20 21:29:50 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 4 at Mon Feb 20 21:30:01 EST 2012
CPP:   Read 5570001 lines in 9 seconds. LPS: 618889
Python:Read 5570000 lines in 1 seconds. LPS: 5570000
Test run 5 at Mon Feb 20 21:30:11 EST 2012
CPP:   Read 5570001 lines in 10 seconds. LPS: 557000
Python:Read 5570000 lines in  1 seconds. LPS: 5570000

Tiny benchmark addendum and recap

For completeness, I thought I'd update the read speed for the same file on the same box with the original (synced) C++ code. Again, this is for a 100M line file on a fast disk. Here's the comparison, with several solutions/approaches:

Implementation Lines per second
python (default) 3,571,428
cin (default/naive) 819,672
cin (no sync) 12,500,000
fgets 14,285,714
wc (not fair comparison) 54,644,808

 Answers

46

tl;dr: Because of different default settings in C++ requiring more system calls.

By default, cin is synchronized with stdio, which causes it to avoid any input buffering. If you add this to the top of your main, you should see much better performance:

std::ios_base::sync_with_stdio(false);

Normally, when an input stream is buffered, instead of reading one character at a time, the stream will be read in larger chunks. This reduces the number of system calls, which are typically relatively expensive. However, since the FILE* based stdio and iostreams often have separate implementations and therefore separate buffers, this could lead to a problem if both were used together. For example:

int myvalue1;
cin >> myvalue1;
int myvalue2;
scanf("%d",&myvalue2);

If more input was read by cin than it actually needed, then the second integer value wouldn't be available for the scanf function, which has its own independent buffer. This would lead to unexpected results.

To avoid this, by default, streams are synchronized with stdio. One common way to achieve this is to have cin read each character one at a time as needed using stdio functions. Unfortunately, this introduces a lot of overhead. For small amounts of input, this isn't a big problem, but when you are reading millions of lines, the performance penalty is significant.

Fortunately, the library designers decided that you should also be able to disable this feature to get improved performance if you knew what you were doing, so they provided the sync_with_stdio method.

Tuesday, June 1, 2021
 
muffe
answered 7 Months ago
40

I've run your benchmark. There is no difference between C++ and numpy on my machine:

woltan's benchmark

Do you think my approach is fair, or are there some unnecessary overheads I can avoid?

It seems fair due to there is no difference in results.

Would you expect that the result would show such a huge discrepancy between the c++ and python approach? Both are using shared objects for their calculations.

No.

Since I would rather use python for my program, what could I do to increase the performance when calling BLAS or LAPACK routines?

Make sure that numpy uses optimized version of BLAS/LAPACK libraries on your system.

Friday, June 11, 2021
 
LukeP
answered 6 Months ago
90

Why?

why is the C version so much faster?

A PostgreSQL array is its self a pretty inefficient data structure. It can contain any data type and it's capable of being multi-dimensional, so lots of optimisations are just not possible. However, as you've seen it's possible to work with the same array much faster in C.

That's because array access in C can avoid a lot of the repeated work involved in PL/PgSQL array access. Just take a look at src/backend/utils/adt/arrayfuncs.c, array_ref. Now look at how it's invoked from src/backend/executor/execQual.c in ExecEvalArrayRef. Which runs for each individual array access from PL/PgSQL, as you can see by attaching gdb to the pid found from select pg_backend_pid(), setting a breakpoint at ExecEvalArrayRef, continuing, and running your function.

More importantly, in PL/PgSQL every statement you execute is run through the query executor machinery. This makes small, cheap statements fairly slow even allowing for the fact that they're pre-prepared. Something like:

a := b + c

is actually executed by PL/PgSQL more like:

SELECT b + c INTO a;

You can observe this if you turn debug levels high enough, attach a debugger and break at a suitable point, or use the auto_explain module with nested statement analysis. To give you an idea of how much overhead this imposes when you're running lots of tiny simple statements (like array accesses), take a look at this example backtrace and my notes on it.

There is also a significant start-up overhead to each PL/PgSQL function invocation. It isn't huge, but it's enough to add up when it's being used as an aggregate.

A faster approach in C

In your case I would probably do it in C, as you have done, but I'd avoid copying the array when called as an aggregate. You can check for whether it's being invoked in aggregate context:

if (AggCheckCallContext(fcinfo, NULL))

and if so, use the original value as a mutable placeholder, modifying it then returning it instead of allocating a new one. I'll write a demo to verify that this is possible with arrays shortly... (update) or not-so-shortly, I forgot how absolute horrible working with PostgreSQL arrays in C is. Here we go:

// append to contrib/intarray/_int_op.c

PG_FUNCTION_INFO_V1(add_intarray_cols);
Datum           add_intarray_cols(PG_FUNCTION_ARGS);

Datum
add_intarray_cols(PG_FUNCTION_ARGS)
{
    ArrayType  *a,
           *b;

    int i, n;

    int *da,
        *db;

    if (PG_ARGISNULL(1))
        ereport(ERROR, (errmsg("Second operand must be non-null")));
    b = PG_GETARG_ARRAYTYPE_P(1);
    CHECKARRVALID(b);

    if (AggCheckCallContext(fcinfo, NULL))
    {
        // Called in aggregate context...
        if (PG_ARGISNULL(0))
            // ... for the first time in a run, so the state in the 1st
            // argument is null. Create a state-holder array by copying the
            // second input array and return it.
            PG_RETURN_POINTER(copy_intArrayType(b));
        else
            // ... for a later invocation in the same run, so we'll modify
            // the state array directly.
            a = PG_GETARG_ARRAYTYPE_P(0);
    }
    else 
    {
        // Not in aggregate context
        if (PG_ARGISNULL(0))
            ereport(ERROR, (errmsg("First operand must be non-null")));
        // Copy 'a' for our result. We'll then add 'b' to it.
        a = PG_GETARG_ARRAYTYPE_P_COPY(0);
        CHECKARRVALID(a);
    }

    // This requirement could probably be lifted pretty easily:
    if (ARR_NDIM(a) != 1 || ARR_NDIM(b) != 1)
        ereport(ERROR, (errmsg("One-dimesional arrays are required")));

    // ... as could this by assuming the un-even ends are zero, but it'd be a
    // little ickier.
    n = (ARR_DIMS(a))[0];
    if (n != (ARR_DIMS(b))[0])
        ereport(ERROR, (errmsg("Arrays are of different lengths")));

    da = ARRPTR(a);
    db = ARRPTR(b);
    for (i = 0; i < n; i++)
    {
            // Fails to check for integer overflow. You should add that.
        *da = *da + *db;
        da++;
        db++;
    }

    PG_RETURN_POINTER(a);
}

and append this to contrib/intarray/intarray--1.0.sql:

CREATE FUNCTION add_intarray_cols(_int4, _int4) RETURNS _int4
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE;

CREATE AGGREGATE sum_intarray_cols(_int4) (sfunc = add_intarray_cols, stype=_int4);

(more correctly you'd create intarray--1.1.sql and intarray--1.0--1.1.sql and update intarray.control. This is just a quick hack.)

Use:

make USE_PGXS=1
make USE_PGXS=1 install

to compile and install.

Now DROP EXTENSION intarray; (if you already have it) and CREATE EXTENSION intarray;.

You'll now have the aggregate function sum_intarray_cols available to you (like your sum(int4[]), as well as the two-operand add_intarray_cols (like your array_add).

By specializing in integer arrays a whole bunch of complexity goes away. A bunch of copying is avoided in the aggregate case, since we can safely modify the "state" array (the first argument) in-place. To keep things consistent, in the case of non-aggregate invocation we get a copy of the first argument so we can still work with it in-place and return it.

This approach could be generalised to support any data type by using the fmgr cache to look up the add function for the type(s) of interest, etc. I'm not particularly interested in doing that, so if you need it (say, to sum columns of NUMERIC arrays) then ... have fun.

Similarly, if you need to handle dissimilar array lengths, you can probably work out what to do from the above.

Wednesday, July 28, 2021
 
Zach
answered 4 Months ago
30

Looks like you're I/O bound:

In computer science, I/O bound refers to a condition in which the time it takes to complete a computation is determined principally by the period spent waiting for input/output operations to be completed. This is the opposite of a task being CPU bound. This circumstance arises when the rate at which data is requested is slower than the rate it is consumed or, in other words, more time is spent requesting data than processing it.

You probably need to have your main thread do the reading and add the data to the pool when a subprocess becomes available. This will be different to using map.

As you are processing a line at a time, and the inputs are split, you can use fileinput to iterate over lines of multiple files, and map to a function processing lines instead of files:

Passing one line at a time might be too slow, so we can ask map to pass chunks, and can adjust until we find a sweet-spot. Our function parses chunks of lines:

def _parse_coreset_points(lines):
    return Points([_parse_coreset_point(line) for line in lines])

def _parse_coreset_point(line):
    s = line.split()
    x, y = [int(v) for v in s]
    return CoresetPoint(x, y)

And our main function:

import fileinput

def getParsedFiles(directory):
    pool = Pool(2)

    txts = [filename for filename in os.listdir(directory):
            if filename.endswith(".txt")]

    return pool.imap(_parse_coreset_points, fileinput.input(txts), chunksize=100)
Friday, July 30, 2021
 
neon29
answered 4 Months ago
45

First off, if the bulk of your "work" comes from processing huge text files, that often means that your only meaningful speed bottleneck is your disk I/O speed, regardless of programming language.


As to the core question, it's probably too opinion-rich to "answer", but I can at least give you my own experience. I've been writing Python to do big data processing (weather and environmental data) for years. I have never once encountered significant performance problems due to the language.

Something that developers (myself included) tend to forget is that once the process runs fast enough, it's a waste of company resources to spend time making it run any faster. Python (using mature tools like pandas/scipy) runs fast enough to meet the requirements, and it's fast to develop, so for my money, it's a perfectly acceptable language for "big data" processing.

Saturday, October 9, 2021
 
Krell
answered 2 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 
Share