# arrayfun can be significantly slower than an explicit loop in matlab. Why?

Consider the following simple speed test for `arrayfun`:

``````T = 4000;
N = 500;
x = randn(T, N);
Func1 = @(a) (3*a^2 + 2*a - 1);

tic
Soln1 = ones(T, N);
for t = 1:T
for n = 1:N
Soln1(t, n) = Func1(x(t, n));
end
end
toc

tic
Soln2 = arrayfun(Func1, x);
toc
``````

On my machine (Matlab 2011b on Linux Mint 12), the output of this test is:

``````Elapsed time is 1.020689 seconds.
Elapsed time is 9.248388 seconds.
``````

What the?!? `arrayfun`, while admittedly a cleaner looking solution, is an order of magnitude slower. What is going on here?

Further, I did a similar style of test for `cellfun` and found it to be about 3 times slower than an explicit loop. Again, this result is the opposite of what I expected.

My question is: Why are `arrayfun` and `cellfun` so much slower? And given this, are there any good reasons to use them (other than to make the code look good)?

Note: I'm talking about the standard version of `arrayfun` here, NOT the GPU version from the parallel processing toolbox.

EDIT: Just to be clear, I'm aware that `Func1` above can be vectorized as pointed out by Oli. I only chose it because it yields a simple speed test for the purposes of the actual question.

EDIT: Following the suggestion of grungetta, I re-did the test with `feature accel off`. The results are:

``````Elapsed time is 28.183422 seconds.
Elapsed time is 23.525251 seconds.
``````

In other words, it would appear that a big part of the difference is that the JIT accelerator does a much better job of speeding up the explicit `for` loop than it does `arrayfun`. This seems odd to me, since `arrayfun` actually provides more information, ie, its use reveals that the order of the calls to `Func1` do not matter. Also, I noted that whether the JIT accelerator is switched on or off, my system only ever uses one CPU...

34

You can get the idea by running other versions of your code. Consider explicitly writing out the computations, instead of using a function in your loop

``````tic
Soln3 = ones(T, N);
for t = 1:T
for n = 1:N
Soln3(t, n) = 3*x(t, n)^2 + 2*x(t, n) - 1;
end
end
toc
``````

Time to compute on my computer:

``````Soln1  1.158446 seconds.
Soln2  10.392475 seconds.
Soln3  0.239023 seconds.
Oli    0.010672 seconds.
``````

Now, while the fully 'vectorized' solution is clearly the fastest, you can see that defining a function to be called for every x entry is a huge overhead. Just explicitly writing out the computation got us factor 5 speedup. I guess this shows that MATLABs JIT compiler does not support inline functions. According to the answer by gnovice there, it is actually better to write a normal function rather than an anonymous one. Try it.

Next step - remove (vectorize) the inner loop:

``````tic
Soln4 = ones(T, N);
for t = 1:T
Soln4(t, :) = 3*x(t, :).^2 + 2*x(t, :) - 1;
end
toc

Soln4  0.053926 seconds.
``````

Another factor 5 speedup: there is something in those statements saying you should avoid loops in MATLAB... Or is there really? Have a look at this then

``````tic
Soln5 = ones(T, N);
for n = 1:N
Soln5(:, n) = 3*x(:, n).^2 + 2*x(:, n) - 1;
end
toc

Soln5   0.013875 seconds.
``````

Much closer to the 'fully' vectorized version. Matlab stores matrices column-wise. You should always (when possible) structure your computations to be vectorized 'column-wise'.

We can go back to Soln3 now. The loop order there is 'row-wise'. Lets change it

``````tic
Soln6 = ones(T, N);
for n = 1:N
for t = 1:T
Soln6(t, n) = 3*x(t, n)^2 + 2*x(t, n) - 1;
end
end
toc

Soln6  0.201661 seconds.
``````

Better, but still very bad. Single loop - good. Double loop - bad. I guess MATLAB did some decent work on improving the performance of loops, but still the loop overhead is there. If you would have some heavier work inside, you would not notice. But since this computation is memory bandwidth bounded, you do see the loop overhead. And you will even more clearly see the overhead of calling Func1 there.

So what's up with arrayfun? No function inlinig there either, so a lot of overhead. But why so much worse than a double nested loop? Actually, the topic of using cellfun/arrayfun has been extensively discussed many times (e.g. here, here, here and here). These functions are simply slow, you can not use them for such fine-grain computations. You can use them for code brevity and fancy conversions between cells and arrays. But the function needs to be heavier than what you wrote:

``````tic
Soln7 = arrayfun(@(a)(3*x(:,a).^2 + 2*x(:,a) - 1), 1:N, 'UniformOutput', false);
toc

Soln7  0.016786 seconds.
``````

Note that Soln7 is a cell now.. sometimes that is useful. Code performance is quite good now, and if you need cell as output, you do not need to convert your matrix after you have used the fully vectorized solution.

So why is arrayfun slower than a simple loop structure? Unfortunately, it is impossible for us to say for sure, since there is no source code available. You can only guess that since arrayfun is a general purpose function, which handles all kinds of different data structures and arguments, it is not necessarily very fast in simple cases, which you can directly express as loop nests. Where does the overhead come from we can not know. Could the overhead be avoided by a better implementation? Maybe not. But unfortunately the only thing we can do is study the performance to identify the cases, in which it works well, and those, where it doesn't.

Update Since the execution time of this test is short, to get reliable results I added now a loop around the tests:

``````for i=1:1000
% compute
end
``````

Some times given below:

``````Soln5   8.192912 seconds.
Soln7  13.419675 seconds.
Oli     8.089113 seconds.
``````

You see that the arrayfun is still bad, but at least not three orders of magnitude worse than the vectorized solution. On the other hand, a single loop with column-wise computations is as fast as the fully vectorized version... That was all done on a single CPU. Results for Soln5 and Soln7 do not change if I switch to 2 cores - In Soln5 I would have to use a parfor to get it parallelized. Forget about speedup... Soln7 does not run in parallel because arrayfun does not run in parallel. Olis vectorized version on the other hand:

``````Oli  5.508085 seconds.
``````
Tuesday, June 1, 2021

64

As Chase said: Use the power of vectorization. You're comparing two bad solutions here.

To clarify why your apply solution is slower:

Within the for loop, you actually use the vectorized indices of the matrix, meaning there is no conversion of type going on. I'm going a bit rough over it here, but basically the internal calculation kind of ignores the dimensions. They're just kept as an attribute and returned with the vector representing the matrix. To illustrate :

``````> x <- 1:10
> attr(x,"dim") <- c(5,2)
> y <- matrix(1:10,ncol=2)
> all.equal(x,y)
 TRUE
``````

Now, when you use the apply, the matrix is split up internally in 100,000 row vectors, every row vector (i.e. a single number) is put through the function, and in the end the result is combined into an appropriate form. The apply function reckons a vector is best in this case, and thus has to concatenate the results of all rows. This takes time.

Also the sapply function first uses `as.vector(unlist(...))` to convert anything to a vector, and in the end tries to simplify the answer into a suitable form. Also this takes time, hence also the sapply might be slower here. Yet, it's not on my machine.

IF apply would be a solution here (and it isn't), you could compare :

``````> system.time(loop_million <- mash(million))
user  system elapsed
0.75    0.00    0.75
> system.time(sapply_million <- matrix(unlist(sapply(million,squish,simplify=F))))
user  system elapsed
0.25    0.00    0.25
> system.time(sapply2_million <- matrix(sapply(million,squish)))
user  system elapsed
0.34    0.00    0.34
> all.equal(loop_million,sapply_million)
 TRUE
> all.equal(loop_million,sapply2_million)
 TRUE
``````
Thursday, June 3, 2021

79

No. You're mixing two important concepts:

• MATLAB is designed to perform vector operations really quickly. MATLAB is an interpreted language, which is why loops are so slow in it. MATLAB sidesteps this issue by providing extremely fast (usually written in C, and optimized for the specific architecture) and well tested functions to operate on vectors. There is really no magic here, it is just a bunch of hard work and many years of constant small improvements.

Consider for example a trivial case such as the following:

``````s=0;
for i=1:length(v),
s = s+v(i);
end
``````

and

``````sum(v)
``````

you should probably use tic and toc to time these two functions to convince yourself of the difference in runtime. There are about 10 similar commonly used functions that operate on vectors, examples are: `bsxfun`, `repmat`, `length`, `find`. Vectorization is a standard part of using MATLAB effectively. Until you can vectorize code effectively you're just a tourist in the world of MATLAB not a citizen.

• Recent versions of MATLAB provide parfor. parfor is not a silver bullet, it is a tool that can be used and misused (try parfor on the sum example above). Not all fors can be parfored. parfor is designed for task-parallel types of problems where each iteration of the loop is independent of each other iteration. This is a key requirement for using a parfor-loop.

While in many cases parfor can help a lot the type of loops that can be parfored for very large gains occur seldomly.

Thursday, July 29, 2021

27

I have always assumed preallocation is faster for any array size and never actually tested it. So, I did a simple test timing the population of various array sizes from 1x1x3 up to 20x20x3 using 1000 iterations by both appending and preallocation methods. Here's the code:

``````arraySize = 1:20;
numIteration = 1000;

timeAppend = zeros(length(arraySize), 1);
timePreAllocate = zeros(length(arraySize), 1);

for ii = 1:length(arraySize);
w = [];
tic;
for jj = 1:numIteration
w = [w; rand(arraySize(ii), arraySize(ii), 3)];
end
timeAppend(ii) = toc;
end;

for ii = 1:length(arraySize);
w = zeros(arraySize(ii) * numIteration, arraySize(ii), 3);
tic;
for jj = 1:numIteration
indexStart = (jj - 1) * arraySize(ii) + 1;
indexStop = indexStart + arraySize(ii) - 1;
w(indexStart:indexStop,:,:) = rand(arraySize(ii), arraySize(ii), 3);
end
timePreAllocate(ii) = toc;
end;

figure;
axes;
plot(timeAppend);
hold on;
plot(timePreAllocate, 'r');
legend('Append', 'Preallocate');
``````

And here are the (as expected) results: Friday, July 30, 2021

58

Approach #1

This first approach in a broad sense has two parts. The first one gets us one big long string that has all the characters from the cell array with underscores separating two cells, which is essentially implemented with `cumsum`. The second one based on `bsxfun` gets us the desired numeric output in the desired format.

The code would look something like this -

``````function out = solution_cumsum_bsxfun(list_of_words)

N = 1 + sum(list_of_words{1}=='_'); %// Number of "words" per cell
lens = cellfun('length',list_of_words); %// No. of chars [lengths] in each cell
tlens = sum(lens); %// Total number of characters [total of lengths]
cumlens = cumsum(lens); %// Cumulative lengths

%// Create an array of ones and zeros.
%// The length of this array would be equal to sum of characters in all cells.
%// The ones would be at the positions where new cells start, zeros otherwise
startpos_newcell(1,tlens)=0;
startpos_newcell(cumlens(1:end-1)+1)=1;

%// Calculate new positions for all characters in the cell array with
%// places/positions left between two cells for putting underscores
newpos = (1:tlens) + cumsum(startpos_newcell);

%// Create one long string that has all the characters from the cell arary
%// with underscores separating cells
one_str(newpos) = [list_of_words{:}];
one_str(cumlens + [1:numel(cumlens)]') = '_'; %//'#

pos_us = find(one_str=='_'); %// positions of underscores
wordend_idx = pos_us - [1:numel(pos_us)]; %// word end indices
wordlens = [wordend_idx(1) diff(wordend_idx)]; %// word lengths
max_wordlens = max(wordlens); %// maximum word length

%// Create mask where digit characters are to be placed

%// Create a numeric array with columns for each word and each row holding a digit.
%// The most significant digits to least significant going from top to bottom

%// Finally get the desired output converting each column to a single
%// number, reshaping to desired size and scaling down by a factor of 1000
out = reshape(10.^(max_wordlens-4:-1:-3)*num1,N,[]).'; %//'#

return;
``````

Approach #2

This a slight variation of approach #1 in that it does the job of first part with `sprintf`. Now, the idea came to me after seeing it in action in @chappjc's solution. I appreciate him on letting me use it in this modified version.

Here's the code -

``````function out = solution_sprintf_bsxfun(list_of_words)

N = 1 + sum(list_of_words{1}=='_'); %// Number of "words" per cell

%// Create one long string that has all the characters from the cell arary
%// with underscores separating cells
one_str = sprintf('%s_',list_of_words{:});

pos_us = find(one_str=='_'); %// positions of underscores
wordend_idx = pos_us - [1:numel(pos_us)]; %// word end indices
wordlens = [wordend_idx(1) diff(wordend_idx)]; %// word lengths
max_wordlens = max(wordlens); %// maximum word length

%// Create mask where digit characters are to be placed

%// Create a numeric array with columns for each word and each row holding a digit.
%// The most significant digits to least significant going from top to bottom

%// Finally get the desired output converting each column to a single
%// number, reshaping to desired size and scaling down by a factor of 1000
out = reshape(10.^(max_wordlens-4:-1:-3)*num1,N,[]).'; %//'#

return;
``````

Approach #3

This is an altogether new approach based on `getByteStreamFromArray` -

``````function out = solution_bytstrm_bsxfun(list_of_words)

allchars = [list_of_words{:}];
digits_array = getByteStreamFromArray(allchars);

%// At my 32-bit system getByteStreamFromArray gets the significant digits
%// from index 65 onwards. Thus, we need to crop/index it accordingly
digits_array = digits_array(65:65+numel(allchars)-1) - 48;
% --------------------------------------------------------------------
N = 1 + sum(list_of_words{1}=='_'); %// Number of "words" per cell
lens = cellfun('length',list_of_words); %// No. of chars [lengths] in each cell
cumlens = cumsum(lens)'; %//'# Cumulative lengths
% ----------------------------------------------------------
pos_us = find(digits_array==47);
starts = [[1 cumlens(1:end-1)+1];reshape(pos_us+1,N-1,[])];
ends = [reshape(pos_us-1,N-1,[]) ; cumlens];
gaps = ends(:) - starts(:) + 1;

maxg = max(gaps);

num_array(maxg,numel(lens)*N)=0;
out = reshape(10.^(maxg-4:-1:-3)*num_array,N,[]).';

return;
``````

GPU versions of the above approaches are listed next.

Approach #4

``````function out = soln_sprintf_bsxfun_gpu(list_of_words)

N = 1 + sum(list_of_words{1}=='_'); %// Number of "words" per cell

%// Create one long string that has all the characters from the cell arary
%// with underscores separating cells. Now this one uses sprintf as
%// firstly proposed in @chappjc's solution and he has agreed to let me use
%// it, so appreciating his help on this.
digits_array = gpuArray(single(sprintf('%s_',list_of_words{:})));
digits_array = digits_array - 48;

wordend_idx = pos_us - gpuArray.colon(1,numel(pos_us)); %// word end indices
wordlens = [wordend_idx(1) diff(wordend_idx)]; %// word lengths
max_wordlens = max(wordlens); %// maximum word length

%// Create a numeric array with columns for each word and each row holding a digit.
%// The most significant digits to least significant going from top to bottom
num1 =  single(zeros(max_wordlens,numel(pos_us),'gpuArray')); %//'
num1(bsxfun(@ge,gpuArray.colon(1,max_wordlens)',...

%// Finally get the desired output converting each column to a single
%// number, reshaping to desired size and scaling down by a factor of 1000
outg = reshape(10.^(max_wordlens-4:-1:-3)*num1,N,[]).'; %//'#
out = gather(outg);
``````

return;

Approach #5

``````function out = soln_bytstrm_bsxfun_gpu(list_of_words)

us_ascii_num = 95;
allchars = [list_of_words{:}];

%// At my 32-bit system getByteStreamFromArray gets the significant digits
%// from index 65 onwards. Thus, we need to crop/index it accordingly
alldigits = getByteStreamFromArray(allchars);
digits_array1 = gpuArray(alldigits(65:65+numel(allchars)-1));
% --------------------------------------------------------------------
lens = cellfun('length',list_of_words); %// No. of chars [lengths] in each cell
N = sum(digits_array1(1:lens(1))==us_ascii_num)+1; %// Number of "words" per cell
lens = gpuArray(lens);
cumlens = cumsum(lens)'; %//'# Cumulative lengths
% ----------------------------------------------------------
starts = [[1 cumlens(1:end-1)+1];reshape(pos_us+1,N-1,[])];
ends = [reshape(pos_us-1,N-1,[]) ; cumlens];
gaps = ends(:) - starts(:) + 1;

maxg = max(gaps);

num_array = zeros(maxg,numel(lens)*N,'gpuArray'); %//'
out = reshape(10.^(maxg-4:-1:-3)*num_array,N,[]).'; %//'
out = gather(out);

return;
``````
Saturday, September 11, 2021