A walk on memory

Python memory walk


Today, for lots of the programmers, memory is place where we store our variables. In most of very high languages memory management is automatic and is done by the programming language itself via an internal process called garbage collection. Memory management sounds like an advanced fancy concept. Honestly it is. Lots of literature exists in the field of computer science dedicated to the topic. While an extensive field and complex if we take time to delve into the intricacies of how the concept of memory is implemented at it lowest level the idea is straightforward. Memory management simply means a set of techniques to reserve and dispose of memory in a computer system.

Automatic memory management simply means that the programmer just needs to reserve memory when it needs and the language will be responsible to free the used memory when it is no longer needed.

For this reason there is a lot of programmers that look to memory like a big bag in which they are able to store new data as they need.

As an example of a very high language we can consider Python. In the following example

class Point:  
    _x=0
    _y=0
    def __init__(self,x,y):
        self._x=x
        self._y=y

    def __str__(self):
        return "{x: "+str(self._x)+",y:"+str(self._y)+"}"

p1=Point(1,1)  
print(p1)  

This will create as output the following

python memory.py  
{x: 1,y:1}

You may ask. Well, ok thats makes sense, your point being? Glad you ask. In the context of memory management this is an example of memory allocation without the respective memory cleanup. This does not means, however, that the cleanup process is not taking place. It just means that you, as a programmer don't need to care.

Now lets add the following as method in the body of Point class. And lets run again

def __del__(self):  
        print("You dont't need to clean your mess. Python will do it for you")

now you'll notice the following

python memory.py  
{x: 1,y:1}
You didn't need to call me python will do it for you  

As expected python being a very nice guy will pray for your sins and as such will manage the memory for you.

Interesting. If you have contact with the design of operative systems (OS) you'll notice that the concept process is a core mechanism. The process in a nutshell is the representation of our programs in memory and it is created by the kernel. Also from OS design we know that memory is far from being a simple black box where we put our data. It is a complex machine. In particular we know that actually the memory of a process is divided in segments, those segments define memory intervals where data is stored. For simplicity we usually think about two main categories, the heap and the stack of the process.

From computer science drawings we know that this is the memory layout we should expect.



So a obvious question arises. Where are the variables being stored?!

Lets create some data in memory

a=12  
b=24  
c=1231231  
hello="Hello"  
world="World"  
p1=Point(1,32)  

Now we just need to see where are they being stored. But how, right?. Well turns out that python has a special function id that we can use to get the memory address of a specific variable.

We create an utility function to help us visualize memory addresses of python variables

def p_mem(l, v):  
    print(l + ":\t" + hex(id(v))+"\t"+str(id(v)))

The first argument is a label and the second is the variable we want to inspect. The method will print on stdout the address in hexadecimal and decimal number representations.

But how, right?. To answer this question I'll assume that you are able to inspect the layout memory of your process. For that we need to ask the kernel. In unix based distributions we actually have a easy way to do it. We can use the /proc/PID/maps. For that you need to find the process id of your program and then use it instead of PID.

For that we just need to

ps aux | grep memory.py  
balhau    20199  0.0  0.0  15936  7308 pts/10   S+   11:27   0:00 python memory.py  

So in my particular case PID=20199. The following command will get the memory segments of the process, and filter out those that pertain to the executable code and libraries loaded into memory, that we are not interested as of now.

cat /proc/201997/maps | grep -v lib | grep -v bin  
56361173f000-563611763000 rw-p 00000000 00:00 0  
563612362000-563612449000 rw-p 00000000 00:00 0                          [heap]  
7f8eeaedc000-7f8eeaf1c000 rw-p 00000000 00:00 0  
7f8eeb48d000-7f8eeb603000 rw-p 00000000 00:00 0  
7f8eeb798000-7f8eeb79c000 rw-p 00000000 00:00 0  
7f8eeb98a000-7f8eeb990000 rw-p 00000000 00:00 0  
7f8eeb9e8000-7f8eeb9e9000 rw-p 00000000 00:00 0  
7ffc9a4a4000-7ffc9a4c6000 rw-p 00000000 00:00 0                          [stack]  
7ffc9a50d000-7ffc9a510000 r--p 00000000 00:00 0                          [vvar]  
7ffc9a510000-7ffc9a511000 r-xp 00000000 00:00 0                          [vdso]  
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]  

In an attempt to try to answer where is our data being stored we will create some variables

#Some small integers
a = 1  
b = 2  
c = 3  
d = 5  
e = 1  
# Some not small integers
x = 1231231  
y = 1231232  
z = 1231233  
t = 1231233  

Now we can use the method created previously to inspect memory address for all the variables created

p_mem("a", a)  
p_mem("b", b)  
p_mem("c", c)  
p_mem("d", d)  
p_mem("e", e)  
p_mem("x", x)  
p_mem("y", y)  
p_mem("z", z)  
p_mem("t", t)  

This give us the following output

a:      0x56361237bcd8  94790233865432  
b:      0x56361237bcc0  94790233865408  
c:      0x56361237bca8  94790233865384  
d:      0x56361237bc78  94790233865336  
e:      0x56361237bcd8  94790233865432  
x:      0x5636123d2c90  94790234221712  
y:      0x5636123d2c78  94790234221688  
z:      0x5636123d2c60  94790234221664  
t:      0x5636123d2c60  94790234221664  

So the answer is, heap.

If we look closely we'll find some interesting patterns on the address of the variables. From a to e all the references are defined in the range 0x56361237bcXX. But there is another set of variables that are defined in another memory range 0x5636123d2XX. But, it gets even more interesting. If you create two new functions

def one_stack_frame():  
    a = 1
    b = 2
    c = 3
    d = 5
    e = 1

    print("Stack Frame One")
    p_mem("a", a)
    p_mem("b", b)
    p_mem("c", c)
    p_mem("d", d)
    p_mem("e", e)
    print("Stack Frame One ended\n")

def two_stack_frame():  
    x = 1231231
    y = 1231232
    z = 1231233
    t = 1231233

    print("Stack Frame Two")
    p_mem("x", x)
    p_mem("y", y)
    p_mem("z", z)
    p_mem("t", t)
    print("Stack Frame Two ended\n")

And call those after the previous code you'll end up with

Main Stack Frame - Small numbers  
a:    0x55e7081becd8  94450761854168  
b:    0x55e7081becc0  94450761854144  
c:    0x55e7081beca8  94450761854120  
d:    0x55e7081bec78  94450761854072  
e:    0x55e7081becd8  94450761854168  
Main Stack Frame - Big numbers  
x:    0x55e708215c30  94450762210352  
y:    0x55e708215cf0  94450762210544  
z:    0x55e708215c18  94450762210328  
t:    0x55e708215c18  94450762210328


Stack Frame One  
a:    0x55e7081becd8  94450761854168  
b:    0x55e7081becc0  94450761854144  
c:    0x55e7081beca8  94450761854120  
d:    0x55e7081bec78  94450761854072  
e:    0x55e7081becd8  94450761854168  
Stack Frame One ended

Stack Frame Two  
x:    0x55e708215c90  94450762210448  
y:    0x55e708215c78  94450762210424  
z:    0x55e708215c60  94450762210400  
t:    0x55e708215c60  94450762210400  
Stack Frame Two ended  

If you look closely you'll notice that the set of small numbers have the exactly same reference than the ones defined outside. However the set of bigger numbers have an entire set of references. If we go check into the python source code you'll find this

#define PYLONG_FROM_UINT(INT_TYPE, ival) \
    do { \
        if (IS_SMALL_UINT(ival)) { \
            return get_small_int((sdigit)(ival)); \
        } \
        /* Count the number of Python digits. */ \
        Py_ssize_t ndigits = 0; \
        INT_TYPE t = (ival); \
        while (t) { \
            ++ndigits; \
            t >>= PyLong_SHIFT; \
        } \
        PyLongObject *v = _PyLong_New(ndigits); \
        if (v == NULL) { \
            return NULL; \
        } \
        digit *p = v->ob_digit; \
        while ((ival)) { \
            *p++ = (digit)((ival) & PyLong_MASK); \
            (ival) >>= PyLong_SHIFT; \
        } \
        return (PyObject *)v; \
    } while(0)

So it seems that in the case that the ival is IS_SMALL_UINT then the will use the get_small_int function to construct the object. Lets keep digging

#define IS_SMALL_UINT(ival) ((ival) < NSMALLPOSINTS)
...
#define NSMALLPOSINTS           _PY_NSMALLPOSINTS
...
#define _PY_NSMALLPOSINTS           257

Interesting. So it appears that in the case the number is less than 257 (so at least 256) then the object representing the number will be given by the get_small_int function. So lets dig it

static PyObject *  
get_small_int(sdigit ival)  
{
    assert(IS_SMALL_INT(ival));
    PyObject *v = __PyLong_GetSmallInt_internal(ival);
    Py_INCREF(v);
    return v;
}
...
static inline PyObject* __PyLong_GetSmallInt_internal(int value)  
{
    PyInterpreterState *interp = _PyInterpreterState_GET();
    assert(-_PY_NSMALLNEGINTS <= value && value < _PY_NSMALLPOSINTS);
    size_t index = _PY_NSMALLNEGINTS + value;
    PyObject *obj = (PyObject*)interp->small_ints[index];
    // _PyLong_GetZero(), _PyLong_GetOne() and get_small_int() must not be
    // called before _PyLong_Init() nor after _PyLong_Fini().
    assert(obj != NULL);
    return obj;
}

And this explains a lot. If you pay attention you'll notice that get_small_int is in the end returning an already available object from an internal cache.

PyObject *obj = (PyObject*)interp->small_ints[index];  
...
PyLongObject* small_ints[_PY_NSMALLNEGINTS + _PY_NSMALLPOSINTS];  

Now a question remains. Is the small_ints stored in the stack or is on the heap?

To answer that lets build small program in c.

void p_mem(const char *label, void *pointer)  
{
    printf("%s:\t%p\t%ld\n", label, pointer, (long)pointer);
}
int* small_ints[265];  
p_mem("small_ints",small_ints);  

The output

small_ints:    0x7ffff5ebb030  140737319252016  

belongs to the stack segment. So it appears that small_ints is stored in the stack. We also found

int  
_PyLong_Init(PyInterpreterState *interp)  
{
    for (Py_ssize_t i=0; i < NSMALLNEGINTS + NSMALLPOSINTS; i++) {
        sdigit ival = (sdigit)i - NSMALLNEGINTS;
        int size = (ival < 0) ? -1 : ((ival == 0) ? 0 : 1);

        PyLongObject *v = _PyLong_New(1);
        if (!v) {
            return -1;
        }

        Py_SET_SIZE(v, size);
        v->ob_digit[0] = (digit)abs(ival);

        interp->small_ints[i] = v;
    }
    return 0;
}

internally the _PyLong_New is using

result = PyObject_Malloc(offsetof(PyLongObject, ob_digit) +  
                             size*sizeof(digit));

malloc. So in short the array is a stack member however the elements in that array are heap allocated. That explains why numbers in python are being stored in a process segment different than the stack.

But, lets back a bit. We noticed that something interesting when we call functions inside other functions. It seems that the stack shrinks when we create functions inside other functions. It also appears that the stack memory is freed when we exit a function.

Lets find out what happens on the stack when we define a method like this

def my_rec_one(level):  
    sf = sys._getframe(1)
    if level > 0:
        my_rec_one(level - 1)
    p_mem("rec_" + str(level),sf)

Lets call my_rec_one(10)

python memory.py  
rec_0:  0x7fe70190e3f0  140630140445680  
rec_1:  0x7fe70190e220  140630140445216  
rec_2:  0x7fe70190e050  140630140444752  
rec_3:  0x7fe7018f2d00  140630140333312  
rec_4:  0x7fe7018f2b30  140630140332848  
rec_5:  0x7fe7018f2960  140630140332384  
rec_6:  0x7fe7018f2790  140630140331920  
rec_7:  0x7fe7018f25c0  140630140331456  
rec_8:  0x7fe7018f23f0  140630140330992  
rec_9:  0x7fe7018f2220  140630140330528  
rec_10: 0x7fe7018eb210  140630140301840  

By using the sf = sys._getframe(1) we can get the python interpreter current stack frame. Now we notice that the output points to the stack memory segment. We also notice, as expected, that the stack shrinks in size when we get deeper in the recursion level.

What if, instead of 10, we call this method with 1000 as an argument?

...
my_rec_one(level - 1)  
RuntimeError: maximum recursion depth exceeded  

So it appears that we can't keep recursion into infinity since python interpreter will use the stack of the process and as such will be limited by it as in the case of lower level languages like c/c++.

So in the end recursion sucks right? Well, not so fast. While general recursion algorithms do suffer from stack overflow problems due the increase of the stack there is, however hope.

In a very nice post from Chris Penner we can see an elegant solution for a particular case of recursion. Tail recursive functions. It turns out that if we adapt the code to inspect the stack

@tail_recursive
def factorial(n, accumulator=1):  
    sf = sys._getframe(1)
    p_mem("rec_"+str(n),sf)
    if n == 0:
        return accumulator
    recurse(n-1, accumulator=accumulator*n)

We will get

python memory.py  
rec_10: 0x7f98b910d620  140293916644896  
rec_9:  0x7f98b910d620  140293916644896  
rec_8:  0x7f98b910d620  140293916644896  
rec_7:  0x7f98b910d620  140293916644896  
rec_6:  0x7f98b910d620  140293916644896  
rec_5:  0x7f98b910d620  140293916644896  
rec_4:  0x7f98b910d620  140293916644896  
rec_3:  0x7f98b910d620  140293916644896  
rec_2:  0x7f98b910d620  140293916644896  
rec_1:  0x7f98b910d620  140293916644896  
rec_0:  0x7f98b910d620  140293916644896  

As we can see not all recursive algorithms are born equal, and in particular tail recursive ones are very efficient with due optimization.

Walking lower in C


After a nice walk on the higher end world of Python lets jump into another realm. Instead lets consider C. In C we have a bit more control on how we store our data in memory.

Consider the following piece of code

int main(int argc, char **argv)  
{
    int a = 12; p_mem("a", &a);
    int b = 24; p_mem("b", &b);
    int c = 48; p_mem("c", &c);

    p_diff("b-a",&a,&b);
    p_diff("c-b",&b,&c);
    p_diff("c-a",&a,&c);

    int x = 21; p_mem("x", &x);
    int y = 42; p_mem("y", &y);
    int z = 84; p_mem("z", &z);

    p_diff("y-x",&x,&y);
    p_diff("z-y",&y,&z); 
    p_diff("z-x",&x,&z);

}

This will print the following

a:      0x7fffb05fb860  140736152451168  
b:      0x7fffb05fb864  140736152451172  
c:      0x7fffb05fb868  140736152451176  
b-a:    p1:0x7fffb05fb860       p2:0x7fffb05fb864       diff:   4  
c-b:    p1:0x7fffb05fb864       p2:0x7fffb05fb868       diff:   4  
c-a:    p1:0x7fffb05fb860       p2:0x7fffb05fb868       diff:   8  
x:      0x7fffb05fb86c  140736152451180  
y:      0x7fffb05fb870  140736152451184  
z:      0x7fffb05fb874  140736152451188  
y-x:    p1:0x7fffb05fb86c       p2:0x7fffb05fb870       diff:   4  
z-y:    p1:0x7fffb05fb870       p2:0x7fffb05fb874       diff:   4  
z-x:    p1:0x7fffb05fb86c       p2:0x7fffb05fb874       diff:   8  

Lets wait a moment. Didn't we say that stack shrinks instead of growing? If we look closely we got the opposite. At first sight we could think that we are allocating variables in the heap. But by checking the memory segments of the process we can verify that we are indeed operating in the stack. So what is going on?

The devil is in the details. The stack shrinks in size. However stack increases/decreases at stack frame level. And what does this exactly means? Good question lad. This means that the stack is reserved during the construction of the main stack frame. Lets change the previous code slightly.

    {
        int a = 12; p_mem("a", &a);
        int b = 24; p_mem("b", &b);
        int c = 48; p_mem("c", &c);

        p_diff("b-a",&a,&b);
        p_diff("c-b",&b,&c);
        p_diff("c-a",&a,&c);
    }

    {
        int x = 21; p_mem("x", &x);
        int y = 42; p_mem("y", &y);
        int z = 84; p_mem("z", &z);

        p_diff("y-x",&x,&y);
        p_diff("z-y",&y,&z); 
        p_diff("z-x",&x,&z);
    }

Albeit odd this is valid C code. And it is an interesting trick. Lets look again at the output

a:      0x7fffc751b13c  140736537407804  
b:      0x7fffc751b140  140736537407808  
c:      0x7fffc751b144  140736537407812  
b-a:    p1:0x7fffc751b13c       p2:0x7fffc751b140       diff:   4  
c-b:    p1:0x7fffc751b140       p2:0x7fffc751b144       diff:   4  
c-a:    p1:0x7fffc751b13c       p2:0x7fffc751b144       diff:   8  
x:      0x7fffc751b13c  140736537407804  
y:      0x7fffc751b140  140736537407808  
z:      0x7fffc751b144  140736537407812  
y-x:    p1:0x7fffc751b13c       p2:0x7fffc751b140       diff:   4  
z-y:    p1:0x7fffc751b140       p2:0x7fffc751b144       diff:   4  
z-x:    p1:0x7fffc751b13c       p2:0x7fffc751b144       diff:   8  

If you look closely the memory used by a,b,c variables is the same as that of x,y,z. This is because the first block defines a frame that has the scope defined by the brackets and the same for the second case. So this is equivalent to allocate the first set of variables on the stack and then cleaning the stack, this process is repeated on the second block.

But wait a minute. We didn't yet prove that stack shrinks. Fair enough. Lets try this then

void stack_frame2(void)  
{
    int b = 2;
    p_mem("b", &b);
}

void stack_frame1(void)  
{
    int a = 1;
    p_mem("a", &a);
    stack_frame2();
}

stack_frame1();  

This will output the following

a:      0x7fff8b570904  140735531124996  
b:      0x7fff8b5708e4  140735531124964  

And now we see it. The second allocation has a lower address value then the first allocation as expected. This exercise show us several interesting behaviors.

  • Creation of stack frames actually will decrease the stack offset
  • Inside a stack frame the compiler is free to order position of variables.
    • This can give us a false impression that stack does not shrink or that it behaves non deterministically.

Heap/Stack speed


There is an old saying stack is faster than heap. But is it? Well lets see.

First lets build a small benchmark utility

#define MEASURE(label, function)                                     \
    {                                                             \
        printf("----Start [%s]----\n", label);       \
        clock_t start, end;                                       \
        double cpu_time_used;                                     \
        start = clock();                                          \
        function();                                                  \
        end = clock();                                            \
        cpu_time_used = ((double)(end - start)) / CLOCKS_PER_SEC; \
        printf("----End [%s]----\n", label);                      \
        printf("----Elapsed %f\n\n", cpu_time_used);              \
    }

Here we used the C preprocessor to group all that code under the macro called MEASURE.

Lets do a simple test. In this test we will

* Repeat MAX_ITER_LOOP_ARRAY times * Create a heap array of size SIZE_ARRAY * Iterate all entries of the array * Create once a stack based array * Iterate all entries of the array

#define MAX_ITER_LOOP_ARRAY 1000
#define SIZE_ARRAY 10000

int arrayOnHeap()  
{
    int i, j = 0;
    int *array, *p;
    int sum, aux = 0;
    for (i = 0; i < MAX_ITER_LOOP_ARRAY; i++)
    {
        array = (int *)malloc(sizeof(int) * SIZE_ARRAY);
        aux = sum - 12;
        sum = (sum * i) ^ sum + aux;
        for (j = 0; j < SIZE_ARRAY; j++)
        {
            *(array + j) = sum + j;
        }
        free(array);
    }
    return sum;
}

int arrayOnStack()  
{
    int array[SIZE_ARRAY];
    int i, j = 0;
    int sum=0;
    int aux = 0;
    for (i = 0; i < MAX_ITER_LOOP_ARRAY; i++)
    {
        aux = sum - 12;
        sum = (sum * i) ^ sum + aux;
        for (j = 0; j < SIZE_ARRAY; j++)
        {
            array[j] = sum + j;
        }
    }
    return sum;
}

MEASURE("Array On Stack", arrayOnStack);  
MEASURE("Array On Heap", arrayOnHeap);  

If the old saying is true we should expect stack approach to be faster right?

----Start [Array On Stack]----
----End [Array On Stack]----
----Elapsed 0.036032

----Start [Array On Heap]----
----End [Array On Heap]----
----Elapsed 0.034358

Well, this is unexpected right? If heap allocation is a more complex process how the heck its faster? Lets probe those heap based arrays. For that we just add the following line

...
p_mem("Heap Array: ",array);  
free(array);  
...
Heap Array::    0x7f2dc2d93010  139834519269392  
Heap Array::    0x55d2b51df6b0  94363470132912  
Heap Array::    0x55d2b51df6b0  94363470132912  
Heap Array::    0x55d2b51df6b0  94363470132912  
Heap Array::    0x55d2b51df6b0  94363470132912  
Heap Array::    0x55d2b51df6b0  94363470132912  
Heap Array::    0x55d2b51df6b0  94363470132912  
Heap Array::    0x55d2b51df6b0  94363470132912  
Heap Array::    0x55d2b51df6b0  94363470132912  
Heap Array::    0x55d2b51df6b0  94363470132912  

Oh I see. So malloc is clever enough to understand that previous free memory allocation is the same as the new allocation request. Well well, sneaky malloc (for more details check the source code) function.

This explains why the performance is comparable, even faster. Lets make a small tweak. Lets create a huge memory leak by removing the free call. If we remove an instruction the code should be even faster right? Well lets see

----Start [Array On Stack]----
----End [Array On Stack]----
----Elapsed 0.361855

----Start [Array On Heap]----
----End [Array On Heap]----
----Elapsed 0.523437

Wow. By removing code we make it slower? What the heck? Lets probe

Heap Array: :   0x7f3f62cbc010  139910217187344  
Heap Array: :   0x7f3f628eb010  139910213185552  
Heap Array: :   0x7f3f6251a010  139910209183760  
Heap Array: :   0x7f3f62149010  139910205181968  
Heap Array: :   0x7f3f61d78010  139910201180176  
Heap Array: :   0x7f3f619a7010  139910197178384  
Heap Array: :   0x7f3f615d6010  139910193176592  
Heap Array: :   0x7f3f61205010  139910189174800  
Heap Array: :   0x7f3f60e34010  139910185173008  
Heap Array: :   0x7f3f60a63010  139910181171216  

So, by removing the free we force malloc to actually allocate a new chunk of memory each time, instead of sneaky caching. That's why, counter intuitively removing code makes it slower, not faster.

The key lesson to take here is to understand that due to the complexities of computer systems and the mechanisms underneath the way we create our tests can vary significantly. The sensitivity to variance is, however, valuable since it give us insight about the underneath mechanisms. We should, however, maintain a conservative spirit in the face of benchmark results since they can be as insightful as deceitful·

Lifting lower tools


It is a shared common knowledge that lower level languages like C don't have the modern mechanisms that very high level languages like Python or Java give us in terms of memory management.

This shared knowledge is true. This fact is however sometimes confused with the stronger statement that by using lower level languages automatic memory management is out of equation. This second proposition while appealingly similar with the previous lacks of true. Automatic memory management is something that can be done with lower as well higher level languages. The question is not if but who will be responsible for doing it.

Lets consider the C language. In terms of memory allocation we know that we can use stack or heap to allocate our data. If we want to implement an automatic way of cleanup memory we only need to take into consideration the heap case since stack has an automatic cleanup mechanism which is done when the stack frame ends.

For this task we can use the help of a friend. The compiler. If we take some time to read through the compiler documentation we notice some advanced features. One feature worth consideration is the variable attribute.

If you check the gnu compiler documentation you'll find this

__attribute__((cleanup(callback_function)))  

While a bit cryptic this is relatively easy to explain, the previous code means

  • When a variable gets out of scope we call the callback_function

Why this is helpful?

Lets consider the following program

typedef struct Point  
{
    int x;
    int y;
} Point;


int main(int argc, char **argv)  
{
    Point *p1 = (Point *)malloc(sizeof(Point));
    p_mem("P1",p1);
    Point *p2 = (Point *)malloc(sizeof(Point));
    p_mem("P2",p2);
}

This program has a obvious problem. We got two variables that will end in memory leak instances. If we run the previous program in valgrind this will be the output

==273121== LEAK SUMMARY:
==273121==    definitely lost: 16 bytes in 2 blocks
==273121==    indirectly lost: 0 bytes in 0 blocks
==273121==      possibly lost: 0 bytes in 0 blocks
==273121==    still reachable: 0 bytes in 0 blocks
==273121==         suppressed: 0 bytes in 0 blocks
==273121== Rerun with --leak-check=full to see details of leaked memory

We are leaking into memory because we created two Point object and we did not cleanup the memory. To sort this issue we need to adapt the code into

Point *p1 = (Point *)malloc(sizeof(Point));  
    p_mem("P1",p1);
    Point *p2 = (Point *)malloc(sizeof(Point));
    p_mem("P2",p2);

    free(p1);
    free(p2);

Now valgrind will happily report

==273337== HEAP SUMMARY:
==273337==     in use at exit: 0 bytes in 0 blocks
==273337==   total heap usage: 3 allocs, 3 frees, 1,040 bytes allocated

But yes, this work is being done by us. This is what most of programmers, understandably, dislike.

This is the part where the compiler come to help. The previous attribute can be used to create something like this macro

void scoped_free(void *pointer)  
{
    printf("Scoped free invoked %p\n",pointer);
    void **pp = (void**)pointer;
    free(*pp);
}

#define AUTO_FREE __attribute__((cleanup(scoped_free)))

...

AUTO_FREE Point *p1 = (Point *)malloc(sizeof(Point));  
p_mem("P1",p1);  
AUTO_FREE Point *p2 = (Point *)malloc(sizeof(Point));  
p_mem("P2",p2);  

If we run the previous program we will end up with

P1:     0x565280e472a0  94912349762208  
P2:     0x565280e476d0  94912349763280  
Scoped free invoked 0x7ffcf11f0210  
Scoped free invoked 0x7ffcf11f0208  

And Valgrind will be happy with us as well

==274435== HEAP SUMMARY:
==274435==     in use at exit: 0 bytes in 0 blocks
==274435==   total heap usage: 3 allocs, 3 frees, 1,040 bytes allocated

We just implemented the most basic automatic memory collection possible. This has several drawbacks (hence basic). The first is that we are forced to cleanup the variables when we got out of scope (here you can see a simple GC implemented in C).