Asked  7 Months ago    Answers:  5   Viewed   30 times

Why does CPython (no clue about other Python implementations) have the following behavior?

tuple1 = ()
tuple2 = ()                                                                                                   
dict1 = {}
dict2 = {}
list1 = []
list2 = []
# makes sense, tuples are immutable
assert(id(tuple1) == id(tuple2))
# also makes sense dicts are mutable
assert(id(dict1) != id(dict2))
# lists are mutable too
assert(id(list1) != id(list2))
assert(id(()) == id(()))
# why no assertion error on this?
assert(id({}) == id({}))
# or this?
assert(id([]) == id([]))

I have a few ideas why it may, but can't find a concrete reason why.

EDIT

To further prove Glenn's and Thomas' point:

[1] id([])
4330909912
[2] x = []
[3] id(x)
4330909912
[4] id([])
4334243440

 Answers

98

CPython is garbage collecting objects as soon as they go out of scope, so the second [] is created after the first [] is collected. So, most of the time it ends up in the same memory location.

This shows what's happening very clearly (the output is likely to be different in other implementations of Python):

class A:
    def __init__(self): print("a")
    def __del__(self): print("b")

# a a b b False
print(A() is A())
# a b a b True
print(id(A()) == id(A()))
Tuesday, June 1, 2021
 
Dail
answered 7 Months ago
39

This is a quirk of how the CPython implementation chooses to cache string literals. String literals with the same contents may refer to the same string object, but they don't have to. 'string' happens to be automatically interned when 'string ' isn't because 'string' contains only characters allowed in a Python identifier. I have no idea why that's the criterion they chose, but it is. The behavior may be different in different Python versions or implementations.

From the CPython 2.7 source code, stringobject.h, line 28:

Interning strings (ob_sstate) tries to ensure that only one string object with a given value exists, so equality tests can be one pointer comparison. This is generally restricted to strings that "look like" Python identifiers, although the intern() builtin can be used to force interning of any string.

You can see the code that does this in Objects/codeobject.c:

/* Intern selected string constants */
for (i = PyTuple_Size(consts); --i >= 0; ) {
    PyObject *v = PyTuple_GetItem(consts, i);
    if (!PyString_Check(v))
        continue;
    if (!all_name_chars((unsigned char *)PyString_AS_STRING(v)))
        continue;
    PyString_InternInPlace(&PyTuple_GET_ITEM(consts, i));
}

Also, note that interning is a separate process from the merging of string literals by the Python bytecode compiler. If you let the compiler compile the a and b assignments together, e.g. by placing them in a module or an if True:, you would find that a and b would be the same string.

Wednesday, June 9, 2021
 
jerrygarciuh
answered 6 Months ago
17

This behaviour was chosen because otherwise jQuery would regularly throw NullReference Exceptions

Almost all jQuery functions return a jQuery object as a wrapper around the Dom elements in question, so you can use dot notation.

$("#balloon").css({"color":"red"});

Now imagine $("#balloon") returned null. That means that $("#balloon").css({"color":"red"}); would throw an error, rather than silently doing nothing as you would expect.

Hence, you just gotta use .length or .size().

Friday, June 11, 2021
 
penpen
answered 6 Months ago
59

The optimization you are trying to trigger is an implementation detail of CPython and is a quite subtle thing: there are many details (e.f. one you are experiencing) which can be preventing it.

For a detailed explanation, one needs to dive into the CPython's implementation, so first I will try to give a hand-waving explanation, which should give at least the gist of what is going on. The gory details will be in the second part which highlights the important code-parts.


Let's take a look at this function, which exhibits the desired/optimized behavior

def add_str(str1, str2, n):
    for i in range(n):
        str1+=str2
        print(id(str1))
    return str1

Calling it, leads to the following output:

>>> add_str("1","2",100)
2660336425032
... 4 times
2660336425032
2660336418608
... 6 times
2660336418608
2660336361520
... 6 times
2660336361520
2660336281800
 and so on

I.e. a new string is created only every 8 addition, otherwise the old string (or as we will see the memory) is reused. The first id is printed only 6 times because it starts printing when the size of the unicode-object is 2 modulo 8 (and not 0 as in the later cases).

The first question is, if a string is immutable in CPython, how (or better when) can it be changed? Obviously, we can't change the string if it is bound to different variables - but we could change it, if the current variable is the only one reference - which can be checked pretty easily due to reference counting of CPython (and it is the reason why this optimization isn't available for other implementation which don't use reference counting).

Let's change the function above by adding a additional reference:

def add_str2(str1, str2, n):
    for i in range(n):
        ref = str1
        str1+=str2
        print(id(str1))
    return str1

Calling it leads to:

>>> add_str2("1","2",20)
2660336437656
2660337149168
2660337149296
2660337149168
2660337149296
... every time a different string - there is copying!

This actually explains your observation:

import sys
s = 'ab'
print(sys.getrefcount(s))
# 9
print(id(s))
# 2660273077752
s+='a'
print(id(s))
# 2660337158664  Different

Your string s is interned (see for example this SO-answer for more information about string interning and integer pool), and thus s is not only one "using" this string and thus this string cannot be changed.

If we avoid the interning, we can see, that the string is reused:

import sys
s = 'ab'*21  # will not be interned
print(sys.getrefcount(s))
# 2, that means really not interned
print(id(s))
# 2660336107312
s+='a'
print(id(s))
# 2660336107312  the same id!

But how does this optimization works?

CPython uses its own memory management - the pymalloc allocator, which is optimized for small objects with short lifetimes. The used memory-blocks are multiple of 8 bytes, that means if allocator is asked for only 1 byte, still 8 bytes are marked as used (more precise because of the 8-byte aligment of the returned pointers the the remaining 7 bytes cannot be used for other objects).

There is however the function PyMem_Realloc: if the allocator is asked to reallocate a 1byte-block as a 2byte-block, there is nothing to do - there were some reserved bytes anyway.

This way, if there is only one reference to the string, CPython can ask the allocator to reallocate the string and require a byte more. In 7 cases of 8 there is nothing to do for allocator and the additional byte becomes available almost free.

However, if the size of the string changes by more than 7 bytes, the copying becomes mandatory:

>>> add_str("1", "1"*8, 20)  # size change of 8
2660337148912
2660336695312
2660336517728
... every time another id

Furthermore, pymalloc falls back to PyMem_RawMalloc, which is usually the memory manager of the C-runtime, and the above optimization for strings is no longer possible:

>>> add_str("1"*512, "1", 20) #  str1 is larger as 512 bytes
2660318800256
2660318791040
2660318788736
2660318807744
2660318800256
2660318796224
... every time another id

Actually, whether the addresses are different after each reallocation depends on the memory allocator of the C-runtime and its state. If memory isn't defragmented, the chances are high, that realloc manages to extend memory without copying (but it was not the case on my machine as I did these experiments), see also this SO-post.


For the curious, here is the whole traceback of the str1+=str2 operation, which can be easily followed in a debugger:

That is what going on:

The += is compiled to BINARY_ADD-optcode and when evaluated in ceval.c, there is a hook/special handling for unicode objects (see PyUnicode_CheckExact):

case TARGET(BINARY_ADD): {
    PyObject *right = POP();
    PyObject *left = TOP();
    PyObject *sum;
    ...
    if (PyUnicode_CheckExact(left) &&
             PyUnicode_CheckExact(right)) {
        sum = unicode_concatenate(left, right, f, next_instr);
        /* unicode_concatenate consumed the ref to left */
    }
    ...

unicode_concatenate ends up calling PyUnicode_Append, which checks whether the left-operand is modifiable (which basically checks that there is only one reference, string isn't interned and some further stuff) and resizes it or creates a new unicode-object otherwise:

if (unicode_modifiable(left)
    && ...)
{
    /* append inplace */
    if (unicode_resize(p_left, new_len) != 0)
        goto error;

    /* copy 'right' into the newly allocated area of 'left' */
    _PyUnicode_FastCopyCharacters(*p_left, left_len, right, 0, right_len);
}
else {
    ...
    /* Concat the two Unicode strings */
    res = PyUnicode_New(new_len, maxchar);
    if (res == NULL)
        goto error;
    _PyUnicode_FastCopyCharacters(res, 0, left, 0, left_len);
    _PyUnicode_FastCopyCharacters(res, left_len, right, 0, right_len);
    Py_DECREF(left);
    ...
}

unicode_resize ends up calling resize_compact (mostly because in our case we have only ascii-characters), which ends up calling PyObject_REALLOC:

...
new_unicode = (PyObject *)PyObject_REALLOC(unicode, new_size);
...

which basically will be calling pymalloc_realloc:

static int
pymalloc_realloc(void *ctx, void **newptr_p, void *p, size_t nbytes)
{
    ...
    /* pymalloc is in charge of this block */
    size = INDEX2SIZE(pool->szidx);
    if (nbytes <= size) {
        /* The block is staying the same or shrinking.
          ....
            *newptr_p = p;
            return 1; // 1 means success!
          ...
    }
    ...
}

Where INDEX2SIZE just rounds up to the nearest multiple of 8:

#define ALIGNMENT               8               /* must be 2^N */
#define ALIGNMENT_SHIFT         3

/* Return the number of bytes in size class I, as a uint. */
#define INDEX2SIZE(I) (((uint)(I) + 1) << ALIGNMENT_SHIFT)

qed.

Friday, August 6, 2021
 
derp
answered 4 Months ago
41

Having another name pointing to the same object kills the optimisation. The optimisation basically works by resizing the string object and appending in place. If you have more than one references to that object, you can't resize without affecting the other reference. With strings being immutable, allowing this would be a serious flaw of the implementation.

temp = result

increased the reference count for the string object named by result thereby prohibiting the optimisation.

The full list of checks performed in the case of += (which eventually translates to PyUnicode_Append) can be seen in the unicode_modifiable function. Among other things, it checks that the reference count of the object is equal to one, that it isn't interned and that it isn't a string subclass.

There's a couple more checks in the if statement guarding this optimisation, if you want a more thorough list.


Though not the basic issue of your question, future readers might be curious about how to efficiently perform string concatenations. Besides similar questions on S.O, the Python FAQ also has an entry on this.

Thursday, August 12, 2021
 
akohout
answered 4 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share