I've read in different places that it is done for "performance reasons", but I still wonder what are the particular cases where performance get improved by this 16-byte alignment. Or, in any case, what were the reasons why this was chosen.
edit: I'm thinking I wrote the question in a misleading way. I wasn't asking about why the processor does things faster with 16-byte aligned memory, this is explained everywhere in the docs. What I wanted to know instead, is how the enforced 16-byte alignment is better than just letting the programmers align the stack themselves when needed. I'm asking this because from my experience with assembly, the stack enforcement has two problems: it is only useful by less 1% percent of the code that is executed (so in the other 99% is actually overhead); and it is also a very common source of bugs. So I wonder how it really pays off in the end. While I'm still in doubt about this, I'm accepting peter's answer as it contains the most detailed answer to my original question.