Public domain cross platform lock free thread caching 16-byte aligned memory allocator implemented in C

Overview

rpmalloc - General Purpose Memory Allocator

This library provides a public domain cross platform lock free thread caching 16-byte aligned memory allocator implemented in C. The latest source code is always available at https://github.com/mjansson/rpmalloc

Created by Mattias Jansson (@maniccoder) - Support development through my GitHub Sponsors page

Platforms currently supported:

  • Windows
  • MacOS
  • iOS
  • Linux
  • Android
  • Haiku

The code should be easily portable to any platform with atomic operations and an mmap-style virtual memory management API. The API used to map/unmap memory pages can be configured in runtime to a custom implementation and mapping granularity/size.

This library is put in the public domain; you can redistribute it and/or modify it without any restrictions. Or, if you choose, you can use it under the MIT license.

Performance

We believe rpmalloc is faster than most popular memory allocators like tcmalloc, hoard, ptmalloc3 and others without causing extra allocated memory overhead in the thread caches compared to these allocators. We also believe the implementation to be easier to read and modify compared to these allocators, as it is a single source file of ~3000 lines of C code. All allocations have a natural 16-byte alignment.

Contained in a parallel repository is a benchmark utility that performs interleaved unaligned allocations and deallocations (both in-thread and cross-thread) in multiple threads. It measures number of memory operations performed per CPU second, as well as memory overhead by comparing the virtual memory mapped with the number of bytes requested in allocation calls. The setup of number of thread, cross-thread deallocation rate and allocation size limits is configured by command line arguments.

https://github.com/mjansson/rpmalloc-benchmark

Below is an example performance comparison chart of rpmalloc and other popular allocator implementations, with default configurations used.

Ubuntu 16.10, random [16, 8000] bytes, 8 cores

The benchmark producing these numbers were run on an Ubuntu 16.10 machine with 8 logical cores (4 physical, HT). The actual numbers are not to be interpreted as absolute performance figures, but rather as relative comparisons between the different allocators. For additional benchmark results, see the BENCHMARKS file.

Configuration of the thread and global caches can be important depending on your use pattern. See CACHE for a case study and some comments/guidelines.

Required functions

Before calling any other function in the API, you MUST call the initialization function, either rpmalloc_initialize or pmalloc_initialize_config, or you will get undefined behaviour when calling other rpmalloc entry point.

Before terminating your use of the allocator, you SHOULD call rpmalloc_finalize in order to release caches and unmap virtual memory, as well as prepare the allocator for global scope cleanup at process exit or dynamic library unload depending on your use case.

Using

The easiest way to use the library is simply adding rpmalloc.[h|c] to your project and compile them along with your sources. This contains only the rpmalloc specific entry points and does not provide internal hooks to process and/or thread creation at the moment. You are required to call these functions from your own code in order to initialize and finalize the allocator in your process and threads:

rpmalloc_initialize : Call at process start to initialize the allocator

rpmalloc_initialize_config : Optional entry point to call at process start to initialize the allocator with a custom memory mapping backend, memory page size and mapping granularity.

rpmalloc_finalize: Call at process exit to finalize the allocator

rpmalloc_thread_initialize: Call at each thread start to initialize the thread local data for the allocator

rpmalloc_thread_finalize: Call at each thread exit to finalize and release thread cache back to global cache

rpmalloc_config: Get the current runtime configuration of the allocator

Then simply use the rpmalloc/rpfree and the other malloc style replacement functions. Remember all allocations are 16-byte aligned, so no need to call the explicit rpmemalign/rpaligned_alloc/rpposix_memalign functions unless you need greater alignment, they are simply wrappers to make it easier to replace in existing code.

If you wish to override the standard library malloc family of functions and have automatic initialization/finalization of process and threads, define ENABLE_OVERRIDE to non-zero which will include the malloc.c file in compilation of rpmalloc.c. The list of libc entry points replaced may not be complete, use libc replacement only as a convenience for testing the library on an existing code base, not a final solution.

For explicit first class heaps, see the rpmalloc_heap*_ API under first class heaps section, requiring RPMALLOC_FIRST_CLASS_HEAPS tp be defined to 1.

Building

To compile as a static library run the configure python script which generates a Ninja build script, then build using ninja. The ninja build produces two static libraries, one named rpmalloc and one named rpmallocwrap, where the latter includes the libc entry point overrides.

The configure + ninja build also produces two shared object/dynamic libraries. The rpmallocwrap shared library can be used with LD_PRELOAD/DYLD_INSERT_LIBRARIES to inject in a preexisting binary, replacing any malloc/free family of function calls. This is only implemented for Linux and macOS targets. The list of libc entry points replaced may not be complete, use preloading as a convenience for testing the library on an existing binary, not a final solution. The dynamic library also provides automatic init/fini of process and threads for all platforms.

The latest stable release is available in the master branch. For latest development code, use the develop branch.

Cache configuration options

Free memory pages are cached both per thread and in a global cache for all threads. The size of the thread caches is determined by an adaptive scheme where each cache is limited by a percentage of the maximum allocation count of the corresponding size class. The size of the global caches is determined by a multiple of the maximum of all thread caches. The factors controlling the cache sizes can be set by editing the individual defines in the rpmalloc.c source file for fine tuned control.

ENABLE_UNLIMITED_CACHE: By default defined to 0, set to 1 to make all caches infinite, i.e never release spans to global cache unless thread finishes and never unmap memory pages back to the OS. Highest performance but largest memory overhead.

ENABLE_UNLIMITED_GLOBAL_CACHE: By default defined to 0, set to 1 to make global caches infinite, i.e never unmap memory pages back to the OS.

ENABLE_UNLIMITED_THREAD_CACHE: By default defined to 0, set to 1 to make thread caches infinite, i.e never release spans to global cache unless thread finishes.

ENABLE_GLOBAL_CACHE: By default defined to 1, enables the global cache shared between all threads. Set to 0 to disable the global cache and directly unmap pages evicted from the thread cache.

ENABLE_THREAD_CACHE: By default defined to 1, enables the per-thread cache. Set to 0 to disable the thread cache and directly unmap pages no longer in use (also disables the global cache).

ENABLE_ADAPTIVE_THREAD_CACHE: Introduces a simple heuristics in the thread cache size, keeping 25% of the high water mark for each span count class.

Other configuration options

Detailed statistics are available if ENABLE_STATISTICS is defined to 1 (default is 0, or disabled), either on compile command line or by setting the value in rpmalloc.c. This will cause a slight overhead in runtime to collect statistics for each memory operation, and will also add 4 bytes overhead per allocation to track sizes.

Integer safety checks on all calls are enabled if ENABLE_VALIDATE_ARGS is defined to 1 (default is 0, or disabled), either on compile command line or by setting the value in rpmalloc.c. If enabled, size arguments to the global entry points are verified not to cause integer overflows in calculations.

Asserts are enabled if ENABLE_ASSERTS is defined to 1 (default is 0, or disabled), either on compile command line or by setting the value in rpmalloc.c.

To include malloc.c in compilation and provide overrides of standard library malloc entry points define ENABLE_OVERRIDE to 1. To enable automatic initialization of finalization of process and threads in order to preload the library into executables using standard library malloc, define ENABLE_PRELOAD to 1.

To enable the runtime configurable memory page and span sizes, define RPMALLOC_CONFIGURABLE to 1. By default, memory page size is determined by system APIs and memory span size is set to 64KiB.

To enable support for first class heaps, define RPMALLOC_FIRST_CLASS_HEAPS to 1. By default, the first class heap API is disabled.

Huge pages

The allocator has support for huge/large pages on Windows, Linux and MacOS. To enable it, pass a non-zero value in the config value enable_huge_pages when initializing the allocator with rpmalloc_initialize_config. If the system does not support huge pages it will be automatically disabled. You can query the status by looking at enable_huge_pages in the config returned from a call to rpmalloc_config after initialization is done.

Quick overview

The allocator is similar in spirit to tcmalloc from the Google Performance Toolkit. It uses separate heaps for each thread and partitions memory blocks according to a preconfigured set of size classes, up to 2MiB. Larger blocks are mapped and unmapped directly. Allocations for different size classes will be served from different set of memory pages, each "span" of pages is dedicated to one size class. Spans of pages can flow between threads when the thread cache overflows and are released to a global cache, or when the thread ends. Unlike tcmalloc, single blocks do not flow between threads, only entire spans of pages.

Implementation details

The allocator is based on a fixed but configurable page alignment (defaults to 64KiB) and 16 byte block alignment, where all runs of memory pages (spans) are mapped to this alignment boundary. On Windows this is automatically guaranteed up to 64KiB by the VirtualAlloc granularity, and on mmap systems it is achieved by oversizing the mapping and aligning the returned virtual memory address to the required boundaries. By aligning to a fixed size the free operation can locate the header of the memory span without having to do a table lookup (as tcmalloc does) by simply masking out the low bits of the address (for 64KiB this would be the low 16 bits).

Memory blocks are divided into three categories. For 64KiB span size/alignment the small blocks are [16, 1024] bytes, medium blocks (1024, 32256] bytes, and large blocks (32256, 2097120] bytes. The three categories are further divided in size classes. If the span size is changed, the small block classes remain but medium blocks go from (1024, span size] bytes.

Small blocks have a size class granularity of 16 bytes each in 64 buckets. Medium blocks have a granularity of 512 bytes, 61 buckets (default). Large blocks have the same granularity as the configured span size (default 64KiB). All allocations are fitted to these size class boundaries (an allocation of 36 bytes will allocate a block of 48 bytes). Each small and medium size class has an associated span (meaning a contiguous set of memory pages) configuration describing how many pages the size class will allocate each time the cache is empty and a new allocation is requested.

Spans for small and medium blocks are cached in four levels to avoid calls to map/unmap memory pages. The first level is a per thread single active span for each size class. The second level is a per thread list of partially free spans for each size class. The third level is a per thread list of free spans. The fourth level is a global list of free spans.

Each span for a small and medium size class keeps track of how many blocks are allocated/free, as well as a list of which blocks that are free for allocation. To avoid locks, each span is completely owned by the allocating thread, and all cross-thread deallocations will be deferred to the owner thread through a separate free list per span.

Large blocks, or super spans, are cached in two levels. The first level is a per thread list of free super spans. The second level is a global list of free super spans.

Memory mapping

By default the allocator uses OS APIs to map virtual memory pages as needed, either VirtualAlloc on Windows or mmap on POSIX systems. If you want to use your own custom memory mapping provider you can use rpmalloc_initialize_config and pass function pointers to map and unmap virtual memory. These function should reserve and free the requested number of bytes.

The returned memory address from the memory map function MUST be aligned to the memory page size and the memory span size (which ever is larger), both of which is configurable. Either provide the page and span sizes during initialization using rpmalloc_initialize_config, or use rpmalloc_config to find the required alignment which is equal to the maximum of page and span size. The span size MUST be a power of two in [4096, 262144] range, and be a multiple or divisor of the memory page size.

Memory mapping requests are always done in multiples of the memory page size. You can specify a custom page size when initializing rpmalloc with rpmalloc_initialize_config, or pass 0 to let rpmalloc determine the system memory page size using OS APIs. The page size MUST be a power of two.

To reduce system call overhead, memory spans are mapped in batches controlled by the span_map_count configuration variable (which defaults to the DEFAULT_SPAN_MAP_COUNT value if 0, which in turn is sized according to the cache configuration define, defaulting to 64). If the memory page size is larger than the span size, the number of spans to map in a single call will be adjusted to guarantee a multiple of the page size, and the spans will be kept mapped until the entire span range can be unmapped in one call (to avoid trying to unmap partial pages).

On macOS and iOS mmap requests are tagged with tag 240 for easy identification with the vmmap tool.

Span breaking

Super spans (spans a multiple > 1 of the span size) can be subdivided into smaller spans to fulfull a need to map a new span of memory. By default the allocator will greedily grab and break any larger span from the available caches before mapping new virtual memory. However, spans can currently not be glued together to form larger super spans again. Subspans can traverse the cache and be used by different threads individually.

A span that is a subspan of a larger super span can be individually decommitted to reduce physical memory pressure when the span is evicted from caches and scheduled to be unmapped. The entire original super span will keep track of the subspans it is broken up into, and when the entire range is decommitted tha super span will be unmapped. This allows platforms like Windows that require the entire virtual memory range that was mapped in a call to VirtualAlloc to be unmapped in one call to VirtualFree, while still decommitting individual pages in subspans (if the page size is smaller than the span size).

If you use a custom memory map/unmap function you need to take this into account by looking at the release parameter given to the memory_unmap function. It is set to 0 for decommitting individual pages and the total super span byte size for finally releasing the entire super span memory range.

Memory fragmentation

There is no memory fragmentation by the allocator in the sense that it will not leave unallocated and unusable "holes" in the memory pages by calls to allocate and free blocks of different sizes. This is due to the fact that the memory pages allocated for each size class is split up in perfectly aligned blocks which are not reused for a request of a different size. The block freed by a call to rpfree will always be immediately available for an allocation request within the same size class.

However, there is memory fragmentation in the meaning that a request for x bytes followed by a request of y bytes where x and y are at least one size class different in size will return blocks that are at least one memory page apart in virtual address space. Only blocks of the same size will potentially be within the same memory page span.

rpmalloc keeps an "active span" and free list for each size class. This leads to back-to-back allocations will most likely be served from within the same span of memory pages (unless the span runs out of free blocks). The rpmalloc implementation will also use any "holes" in memory pages in semi-filled spans before using a completely free span.

First class heaps

rpmalloc provides a first class heap type with explicit heap control API. Heaps are maintained with calls to rpmalloc_heap_acquire and rpmalloc_heap_release and allocations/frees are done with rpmalloc_heap_alloc and rpmalloc_heap_free. See the rpmalloc.h documentation for the full list of functions in the heap API. The main use case of explicit heap control is to scope allocations in a heap and release everything with a single call to rpmalloc_heap_free_all without having to maintain ownership of memory blocks. Note that the heap API is not thread-safe, the caller must make sure that each heap is only used in a single thread at any given time.

Producer-consumer scenario

Compared to the some other allocators, rpmalloc does not suffer as much from a producer-consumer thread scenario where one thread allocates memory blocks and another thread frees the blocks. In some allocators the free blocks need to traverse both the thread cache of the thread doing the free operations as well as the global cache before being reused in the allocating thread. In rpmalloc the freed blocks will be reused as soon as the allocating thread needs to get new spans from the thread cache. This enables faster release of completely freed memory pages as blocks in a memory page will not be aliased between different owning threads.

Best case scenarios

Threads that keep ownership of allocated memory blocks within the thread and free the blocks from the same thread will have optimal performance.

Threads that have allocation patterns where the difference in memory usage high and low water marks fit within the thread cache thresholds in the allocator will never touch the global cache except during thread init/fini and have optimal performance. Tweaking the cache limits can be done on a per-size-class basis.

Worst case scenarios

Since each thread cache maps spans of memory pages per size class, a thread that allocates just a few blocks of each size class (16, 32, ...) for many size classes will never fill each bucket, and thus map a lot of memory pages while only using a small fraction of the mapped memory. However, the wasted memory will always be less than 4KiB (or the configured memory page size) per size class as each span is initialized one memory page at a time. The cache for free spans will be reused by all size classes.

Threads that perform a lot of allocations and deallocations in a pattern that have a large difference in high and low water marks, and that difference is larger than the thread cache size, will put a lot of contention on the global cache. What will happen is the thread cache will overflow on each low water mark causing pages to be released to the global cache, then underflow on high water mark causing pages to be re-acquired from the global cache. This can be mitigated by changing the MAX_SPAN_CACHE_DIVISOR define in the source code (at the cost of higher average memory overhead).

Caveats

VirtualAlloc has an internal granularity of 64KiB. However, mmap lacks this granularity control, and the implementation instead oversizes the memory mapping with configured span size to be able to always return a memory area with the required alignment. Since the extra memory pages are never touched this will not result in extra committed physical memory pages, but rather only increase virtual memory address space.

All entry points assume the passed values are valid, for example passing an invalid pointer to free would most likely result in a segmentation fault. The library does not try to guard against errors!.

To support global scope data doing dynamic allocation/deallocation such as C++ objects with custom constructors and destructors, the call to rpmalloc_finalize will not completely terminate the allocator but rather empty all caches and put the allocator in finalization mode. Once this call has been made, the allocator is no longer thread safe and expects all remaining calls to originate from global data destruction on main thread. Any spans or heaps becoming free during this phase will be immediately unmapped to allow correct teardown of the process or dynamic library without any leaks.

Other languages

Johan Andersson at Embark has created a Rust wrapper available at rpmalloc-rs

Stas Denisov has created a C# wrapper available at Rpmalloc-CSharp

License

This is free and unencumbered software released into the public domain.

Anyone is free to copy, modify, publish, use, compile, sell, or distribute this software, either in source code form or as a compiled binary, for any purpose, commercial or non-commercial, and by any means.

In jurisdictions that recognize copyright laws, the author or authors of this software dedicate any and all copyright interest in the software to the public domain. We make this dedication for the benefit of the public at large and to the detriment of our heirs and successors. We intend this dedication to be an overt act of relinquishment in perpetuity of all present and future rights to this software under copyright law.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

For more information, please refer to http://unlicense.org

You can also use this software under the MIT license if public domain is not recognized in your country

The MIT License (MIT)

Copyright (c) 2017 Mattias Jansson

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Comments
  • Significantly increased memory usage following #102

    Significantly increased memory usage following #102

    So, I'm testing the new changes to the cross-thread frees (#102) and I noticed that memory usage is significantly increased following that. E.g. WebPositive (our WebKit-based browser) uses ~64MB after loading GitHub before the patch, and ~80MB afterwards (enabling/disabling the adaptive thread cache appears to have no effect.) Other applications appear similarly affected, though not as badly (+10% instead of +25% seems more regular.)

    A 25% increase in memory usage is not insignificant; is there any chance this could be improved? Probably WebKit on Linux behaves much the same way, in case you'd like to take a look yourself.

    performance 
    opened by waddlesplash 61
  • Address of 0x30 ending up in `free_list`

    Address of 0x30 ending up in `free_list`

    I don't know how to provide much information on this as I don't understand rpmalloc nor am I trying to.

    I'm using Tracy on Ubuntu 18.04, gcc7. I'm encountering a crash when asking Tracy to store a copy of a string. Specifically, through use of ZoneText or TracyMessage calls.

    The best information I'm aware of that I can provide is this valgrind output, which is hopefully enough to help. If not, please let me know how else I can be of assistance.

    valgrind output

    potential bug 
    opened by RobotCaleb 33
  • Reduce peak commit on memory-alloc intensive apps

    Reduce peak commit on memory-alloc intensive apps

    I've recently integrated rpmalloc and mimalloc into LLVM, please see thread: https://reviews.llvm.org/D71786 I discovered along the way that rpmalloc takes more memory than mimalloc when linking with LLD & ThinLTO. For example:

    	              | Working Set (B) | Private Working Set (B) | Commit (B) | Virtual Size (B)
    rpmalloc - 36-threads |     25.1 GB     |          16.5 GB        |   19.9 GB  |      37.4 GB
    mimalloc - 36-threads |     25.6 GB     |          16.3 GB        |   18.3 GB  |      33.3 GB
    rpmalloc - 72-threads |     33.6 GB     |          25.1 GB        |   28.5 GB  |      46 GB
    mimalloc - 72-threads |     30.5 GB     |          21.2 GB        |   23.4 GB  |      38.4 GB
    

    There's a difference in terms of execution time, in favor of mimalloc. It seems the difference is proportional to the difference of the commit size between the two.

    bla

    To repro (windows bash, but you could probably repro all this on Linux as well),

    $ git clone https://github.com/llvm/llvm-project.git
    # Download patch from https://reviews.llvm.org/D71786
    $ git apply D71786.txt
    
    # ROOT is where LLVM was checked out by git clone above, modify accordingly
    $ set ROOT=d:/llvm-project
    $ set LLVM=c:/Program Files/LLVM
    
    # Ensure cmake, python 3.7, gnuWin32, git, ninja build and a LLVM package (llvm.org) are installed first.
    $ cd %ROOT%
    $ mkdir stage1
    $ cd stage1
    
    # Feel free to fiddle the following flags according to your hardware config
    $ set OPT_AVX=/GS- /D_ITERATOR_DEBUG_LEVEL=0 /arch:AVX
    $ set OPT_SKYLAKE=/GS- /D_ITERATOR_DEBUG_LEVEL=0 -Xclang -O3 -Xclang -fwhole-program-vtables -fstrict-aliasing -march=skylake-avx512
    
    $ cmake -GNinja %ROOT%/llvm -DCMAKE_BUILD_TYPE=Release -DLLVM_OPTIMIZED_TABLEGEN=ON -DLLVM_ENABLE_ASSERTIONS=ON -DLLVM_ENABLE_LIBXML2=OFF -DCMAKE_C_COMPILER="%LLVM%/bin/clang-cl.EXE" -DCMAKE_CXX_COMPILER="%LLVM%/bin/clang-cl.EXE" -DCMAKE_LINKER="%LLVM%/bin/lld-link.EXE" -DLLVM_ENABLE_PROJECTS="llvm;clang;lld" -DLLVM_ENABLE_PDB=ON -DLLVM_ENABLE_LLD=ON -DLLVM_USE_CRT_RELEASE=MT -DCMAKE_CXX_FLAGS="%OPT_AVX%" -DCMAKE_C_FLAGS="%OPT_AVX%" 
    
    $ ninja check-all
    # This should yield no errors, or if it does, they were there before on trunk
    
    # Now build the stage2:
    $ cd %ROOT%
    $ mkdir stage2
    $ cd stage2
    
    $ set LLVM_LOCAL=%ROOT%/stage1
    $ cmake -G"Ninja" %ROOT%/llvm -DCMAKE_BUILD_TYPE=Release -DLLVM_OPTIMIZED_TABLEGEN=true -DLLVM_ENABLE_LIBXML2=OFF -DLLVM_USE_CRT_RELEASE=MT -DCMAKE_C_COMPILER="%LLVM_LOCAL%/bin/clang-cl.exe" -DCMAKE_CXX_COMPILER="%LLVM_LOCAL%/bin/clang-cl.exe" -DCMAKE_LINKER="%LLVM_LOCAL%/bin/lld-link.exe" -DLLVM_ENABLE_LLD=ON -DLLVM_ENABLE_PDB=ON -DLLVM_ENABLE_PROJECTS="llvm;clang;lld" -DCMAKE_CXX_FLAGS="%OPT_SKYLAKE%" -DCMAKE_C_FLAGS="%OPT_SKYLAKE%" -DLLVM_ENABLE_LTO=THIN -DCLANG_TABLEGEN="%LLVM_LOCAL%/bin/clang-tblgen.exe" -DLLVM_TABLEGEN="%LLVM_LOCAL%/bin/llvm-tblgen.exe"
    
    $ ninja check-all
    # This should take a lot longer, because we're now building the LLVM .exes with ThinLTO.
    # Ensure you've got at least 150 GB free on the SSD. The ThinLTO cache takes a lot of space.
    
    # Prepare for the test (pwd is still in the stage2 folder)
    $ rm bin\clang.exe
    $ ninja clang -v
    # This will print the cmd-line to use to link clang. Copy-paste it in a file stage2\link.rsp.
    # While the above ninja cmd-line links, duplicate stage2\CMakeFiles\clang.rsp to another file, say clang2.rsp. This is a temp file which is deleted once linking ends.
    # Reference clang2.rsp instead of clang.rsp from stage2\link.rsp
    # Ensure you remove the LTO cache flag from link.rsp
    
    $ bin\lld-link @link.rsp /time
    # This is your final test, which will use the stage2 lld-link.exe to link the stage2 clang.exe.
    # Try it once to see the time it takes. You would probably want to re-run it with rpmalloc's stats enabled.
    

    To compare with mimalloc, you'd need to compile first mimalloc as a static lib (disable /GL). You can reference it then in place of rpmalloc, by using the following patch (simply revert this file from the previous patch, before applying):

    diff --git a/llvm/lib/Support/CMakeLists.txt b/llvm/lib/Support/CMakeLists.txt
    index 26332d4f539..77c7645592c 100644
    --- a/llvm/lib/Support/CMakeLists.txt
    +++ b/llvm/lib/Support/CMakeLists.txt
    @@ -51,6 +51,31 @@ else()
       set(Z3_LINK_FILES "")
     endif()
     
    +# if(LLVM_ENABLE_RPMALLOC)
    +#   set(RPMALLOC_FILES rpmalloc/rpmalloc.c)
    +# else()
    +#   set(RPMALLOC_FILES "")
    +# endif()
    +set(ALLOC_BENCH_PATH "D:/git/rpmalloc-benchmark/benchmark/")
    +
    +# mimalloc
    +set(ALLOCATOR_FILES "${ALLOC_BENCH_PATH}mimalloc/benchmark.c")
    +set(ALLOCATOR_INCLUDES "${ALLOC_BENCH_PATH}mimalloc/include/" "${ALLOC_BENCH_PATH}")
    +set(system_libs ${system_libs} "D:/git/mimalloc/out/msvc-x64/Release/mimalloc-static.lib" "-INCLUDE:malloc")
    +
    +# rpmalloc
    +# set(ALLOCATOR_FILES "${ALLOC_BENCH_PATH}rpmalloc/benchmark.c" "${ALLOC_BENCH_PATH}rpmalloc/rpmalloc.c")
    +# set(ALLOCATOR_INCLUDES "${ALLOC_BENCH_PATH}rpmalloc/" "${ALLOC_BENCH_PATH}")
    +
    +# tcmalloc
    +# set(ALLOCATOR_FILES "${ALLOC_BENCH_PATH}gperftools/benchmark.c")
    +# set(ALLOCATOR_INCLUDES "${ALLOC_BENCH_PATH}gperftools/" "${ALLOC_BENCH_PATH}")
    +# set(system_libs ${system_libs} "D:/git/rpmalloc-benchmark/benchmark/gperftools/x64/Release-Override/libtcmalloc_minimal.lib" "-INCLUDE:malloc")
    +
    +# ptmalloc3
    +# set(ALLOCATOR_FILES "${ALLOC_BENCH_PATH}ptmalloc3/benchmark.c" "${ALLOC_BENCH_PATH}ptmalloc3/malloc.c" "${ALLOC_BENCH_PATH}ptmalloc3/ptmalloc3.c")
    +# set(ALLOCATOR_INCLUDES "${ALLOC_BENCH_PATH}ptmalloc3/" "${ALLOC_BENCH_PATH}" "${ALLOC_BENCH_PATH}ptmalloc3/sysdeps/windows")
    +
     add_llvm_component_library(LLVMSupport
       AArch64TargetParser.cpp
       ABIBreak.cpp
    @@ -163,6 +188,8 @@ add_llvm_component_library(LLVMSupport
       xxhash.cpp
       Z3Solver.cpp
     
    +  ${ALLOCATOR_FILES}
    +
     # System
       Atomic.cpp
       DynamicLibrary.cpp
    @@ -197,3 +224,8 @@ if(LLVM_WITH_Z3)
         ${Z3_INCLUDE_DIR}
         )
     endif()
    +
    +  target_include_directories(LLVMSupport SYSTEM
    +   PRIVATE
    +   ${ALLOCATOR_INCLUDES}
    +   )
    \ No newline at end of file
    

    You don't need to rebuild stage1, only stage2. You don't need to call cmake again, you can simply call ninja all -C stage2 after applying the mimalloc modification above. You can then switch between rpmalloc and mimalloc by commenting-out the relevant sections in this file, and re-running ninja.

    At this point, you should see a difference in terms of peak Committed memory. I'm using UIforETW (https://github.com/google/UIforETW) to take profiles on Windows.

    You can probably repro this on Linux as well, and maybe linking a smaller program instead of clang.exe if you want faster iteration. Please don't hesitate to poke me by email if any of these doesn't work or if you're stuck.

    performance 
    opened by aganea 29
  • Need better implementation of rpaligned_alloc

    Need better implementation of rpaligned_alloc

    The current implementation is rather naive (it wastes up to alignment bytes by simply adding the missing alignment to the pointer) and also prevents requesting alignments larger than the page size.

    Some applications, e.g. WebKit, explicitly request memory with rather large alignments (16KB on 32-bit, 64K on 64-bit), and on Haiku where the 32-bit page size is 4K (and where I'm experimenting which making rpmalloc the system allocator), this obviously fails.

    opened by waddlesplash 26
  • "Leak"? of 128kB heap areas

    Here's a file showing a lot of unused memory in heap areas on a running Haiku system: https://dev.haiku-os.org/attachment/ticket/15264/out.txt

    The fields by order are: area ID, area name, reserved size, RAM size (i.e. touched pages.) Units are kB. As you can see, there are quite literally thousands (about ~5000) heap areas (rpmalloc is the only thing which names its areas this way) of size 128KB, and none are using more than 36kb, and most at 20kb. So there is 500MB of reserved memory in these that are not being used.

    Are these the small size classes I guess? But then I'd expect more than 36kb to be used in at least some of them. There is also about 700MB wasted in 4MB heap areas towards the end of the file, but all those have varying usages so that is certainly a different issue.

    performance 
    opened by waddlesplash 23
  • Random access violations in custom test on windows

    Random access violations in custom test on windows

    Hello,

    I've been trying to do some testing with your library and have run into issues while running the following:

    std::size_t nAllocations = 1000000;
    TEST(rpmalloc_test_suite, cross_thread_bench)
    {
        rpmalloc_initialize();
        using namespace stk::thread;
        std::size_t nOSThreads = std::thread::hardware_concurrency();
        work_stealing_thread_pool<moodycamel_concurrent_queue_traits> pool(rpmalloc_thread_initialize, rpmalloc_thread_finalize, nOSThreads);
        using future_t = boost::future<void>;
        std::vector<future_t> futures;
        futures.reserve(nAllocations);
        {
            GEOMETRIX_MEASURE_SCOPE_TIME("rpmalloc_cross_thread_32_bytes");
            for (size_t i = 0; i < nAllocations; ++i)
            {
                auto pAlloc = rpmalloc(32);
                futures.emplace_back(pool.send(i++%pool.number_threads(),
                    [pAlloc]()
                    {
                        rpfree(pAlloc);
                    }));
            }
    
            boost::for_each(futures, [](const future_t& f) { f.wait(); });
        }
        rpmalloc_finalize();
    }
    

    Essentially, allocate the block in one thread and deallocate in another. I get random access violations though. Does my usage seem correct?

    bug 
    opened by brandon-kohn 17
  • segfault in python 2.7 shutdown

    segfault in python 2.7 shutdown

    I get a segfault after running this script: https://github.com/pixelb/ps_mem

    Program received signal SIGSEGV, Segmentation fault.
    rpmalloc_finalize () at rpmalloc/rpmalloc.c:1345
    1345                            _memory_deallocate_deferred(heap, 0);
    (gdb) bt
    #0  rpmalloc_finalize () at rpmalloc/rpmalloc.c:1345
    #1  0x00007ffff7de9a3a in ?? () from /lib64/ld-linux-x86-64.so.2
    #2  0x00007ffff7657940 in ?? () from /lib64/libc.so.6
    #3  0x00007ffff765799a in exit () from /lib64/libc.so.6
    #4  0x00007ffff76421e8 in __libc_start_main () from /lib64/libc.so.6
    #5  0x000000000040063a in _start ()
    (gdb) 
    
    bug 
    opened by justanerd 15
  • Segfault deallocating assign()ed std::string on macOS when built with -O2 or -O3

    Segfault deallocating assign()ed std::string on macOS when built with -O2 or -O3

    On macOS (regardless of x86-64 or arm64), the below snippet when ran with (gcc will also repro) clang -L/path/to/rpmalloc/lib/macos/release -lrpmallocwrap -O2 -g -std=c++17 -lstdc++ test.cc && ./a.out will segfault.

    This does not segfault with -O0 or -O1.

    This was last tested using rpmalloc at ad42d51579e51aafcd8e7c0ae19d9a3e969350fd built with either python3 configure.py -c release -a x86-64 --lto && ninja or python3 configure.py -c release -a arm64 --lto && ninja, however this appears to reproduce even without --lto.

    Tested on:

    • arm64 (M1 Pro) / macOS 12.3.1 / Apple clang version 13.1.6 (clang-1316.0.21.2.3)
    • arm64 (M1 Pro) / macOS 12.3.1 / gcc-11 (Homebrew GCC 11.3.0) 11.2.0
    • x86-64 (M1 Pro) / macOS 12.3.1 / Apple clang version 13.1.6 (clang-1316.0.21.2.3) with arch -x86_64 and cross-compiled rpmalloc
    • x86-64 (Intel) / macOS 12.2.1 / Apple clang version 13.0.0 (clang-1300.0.29.3)
    #include <string>
    
    void test() {
      std::string foo;
      foo.assign("___________________________________________");
      printf("foo at %p\n", &foo);
    }
    
    int main(int argc, char** argv) {
      test();
      printf("safe!\n");
    }
    

    This segfault does not repro with any of these variations (tested only on arm64) below:

    void test() {
      std::string foo;
      foo.assign("hello");
      printf("foo at %p\n", &foo);
    }
    
    void test() {
      std::string foo = "___________________________________________";
      printf("foo at %p\n", &foo);
    }
    

    For sake of completion, this segfault does repro as follows, however after the printing of "safe!".

    int main(int argc, char** argv) {
      std::string foo;
      foo.assign("___________________________________________");
      printf("safe!\n");
    }
    

    The output from lldb on arm64 is as follows (note I have modified void rpfree(void *ptr) to add printf("rpfree called with %p\n", ptr);):

     % lldb ./a.out
    (lldb) target create "./a.out"
    Current executable set to '/Users/richard/Projects/gn/a.out' (arm64).
    (lldb) r
    Process 3344 launched: '/Users/richard/Projects/gn/a.out' (arm64)
    foo at 0x16fdff298
    rpfree called with 0x600000c00060
    a.out was compiled with optimization - stepping may behave oddly; variables may not be available.
    Process 3344 stopped
    * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xffffffffffffffff)
        frame #0: 0x0000000100004e30 a.out`_rpmalloc_deallocate [inlined] _rpmalloc_deallocate_small_or_medium(span=0x0000600000c00000, p=0x0000600000c00060) at rpmalloc.c:2434:28 [opt]
       2431 #if RPMALLOC_FIRST_CLASS_HEAPS
       2432         int defer = (span->heap->owner_thread && (span->heap->owner_thread != get_thread_id()) && !span->heap->finalize);
       2433 #else
    -> 2434         int defer = ((span->heap->owner_thread != get_thread_id()) && !span->heap->finalize);
       2435 #endif
       2436         if (!defer)
       2437                 _rpmalloc_deallocate_direct_small_or_medium(span, p);
    Target 0: (a.out) stopped.
    (lldb) bt
    * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xffffffffffffffff)
      * frame #0: 0x0000000100004e30 a.out`_rpmalloc_deallocate [inlined] _rpmalloc_deallocate_small_or_medium(span=0x0000600000c00000, p=0x0000600000c00060) at rpmalloc.c:2434:28 [opt]
        frame #1: 0x0000000100004e08 a.out`_rpmalloc_deallocate(p=<unavailable>) at rpmalloc.c:2527:3 [opt]
        frame #2: 0x0000000100007d30 a.out`main [inlined] void std::__1::__libcpp_operator_delete<void*>(__args=<unavailable>) at new:245:3 [opt]
        frame #3: 0x0000000100007d2c a.out`main [inlined] void std::__1::__do_deallocate_handle_size<>(__ptr=<unavailable>) at new:269:10 [opt]
        frame #4: 0x0000000100007d2c a.out`main [inlined] std::__1::__libcpp_deallocate(__ptr=<unavailable>, __align=1) at new:285:14 [opt]
        frame #5: 0x0000000100007d2c a.out`main [inlined] std::__1::allocator<char>::deallocate(__p=<unavailable>) at allocator.h:117:13 [opt]
        frame #6: 0x0000000100007d2c a.out`main [inlined] std::__1::allocator_traits<std::__1::allocator<char> >::deallocate(__p=<unavailable>) at allocator_traits.h:282:13 [opt]
        frame #7: 0x0000000100007d2c a.out`main [inlined] std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_string(this="___________________________________________") at string:2275:9 [opt]
        frame #8: 0x0000000100007d20 a.out`main [inlined] std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >::~basic_string(this="___________________________________________") at string:2270:1 [opt]
        frame #9: 0x0000000100007d20 a.out`main [inlined] test() at test.cc:7:1 [opt]
        frame #10: 0x0000000100007cf4 a.out`main(argc=<unavailable>, argv=<unavailable>) at test.cc:10:3 [opt]
        frame #11: 0x00000001001a1088 dyld`start + 516
    (lldb) p span
    (span_t *) $0 = 0x0000600000c00000
    (lldb) p span->heap
    (heap_t *) $1 = 0xffffffffffffffff
    (lldb) 
    
    compatibility 
    opened by rzhw 12
  • Improve cross thread use cases

    Improve cross thread use cases

    Improve cross thread use cases, either by redesigning the deferred deallocation scheme or introducing an optional opportunistic locking scheme for freeing blocks in other heaps.

    enhancement 
    opened by mjansson 12
  • GCD and rpmalloc_thread_initialize

    GCD and rpmalloc_thread_initialize

    Hi,

    I'm trying to include rpmalloc into my ios project however, since we use GCD I'm not sure where to call rpmalloc_thread_initialize.

    Without this call the app crashes in all the callers that are not performed in the main thread where rpmalloc_initialize was called.

    Thanksm

    need test case 
    opened by vitorhugomagalhaes 11
  • Error while running configure.py.

    Error while running configure.py.

    I am getting an error when I attempt to run configure.py. Traceback (most recent call last): File "configure.py", line 12, in generator = generator.Generator(project = 'rpmalloc', variables = [('bundleidentifier', 'com.rampantpixels.rpmalloc.$(binname)')]) File "build\ninja\generator.py", line 104, in init self.toolchain.initialize(project, archs, configs, includepaths, dependlibs, libpaths, variables, self.subninja) File "build\ninja\msvc.py", line 75, in initialize self.build_toolchain() File "build\ninja\msvc.py", line 131, in build_toolchain installed_versions = vslocate.get_vs_installations() File "build\ninja\vslocate.py", line 92, in get_vs_installations dll = ctypes.WinDLL(dll_path) File "C:\Users\ssaha\AppData\Local\Programs\Python\Python38-32\lib\ctypes_init_.py", line 373, in init self._handle = _dlopen(self._name, mode) OSError: [WinError 193] %1 is not a valid Win32 application

    opened by Sidharth-Saha 9
  • Some usage annoyances while compiling & using rpmalloc on Sony's consoles

    Some usage annoyances while compiling & using rpmalloc on Sony's consoles

    Hi! While integrating rpmalloc in a project that's compiled for Sony's consoles, I found some issues that even though minor, could improve the user experience if fixed. So I thought I would drop you a line and see what you think about them:

    1. The first issue is that I wasted a good couple days of weird crashes due to not reading the comments / documentation about the requirement for all returned OS blocks to be aligned to the span size. Obviously this is on me for not RTFM, but since this is (as I later learned) such a fundamental detail to the inner workings of the allocator, I would have expected for an assert to trigger somewhere, specially since I run with both ENABLE_VALIDATE_ARGS & ENABLE_ASSERTS always on on debug builds.

    2. When I finally figured out what was going on, I resorted to doing the exact same computation you do in _memory_map_os to manually align the blocks returned from the platform. However, the computation there uses the configured allocation granularity to figure out whether it's even necessary to add any padding or not. But, alas, there's no access to configured granularity anywhere that I could find. This also strikes me as a bit weird since there may be platforms where this needs to be specified in the config anyway? (Windows is a good example where this is different from the page size, if a Windows platform wanted to rely on the config interface alone, it would have no way of specifying this to rpmalloc_initialize). In my case I resorted to adding this additional field to the config struct.

    One last thing, just to point out that the comment about the offset parameter returned from the memory_map function callback seems to be wrong / outdated. It's specified that this must be <= UINT16_MAX, but as I could see in the source code, this seems to be stored in either a u32 or a size_t field, so this requirement seems unnecessary. (I didn't check if this has been recently corrected, my version of the source code is easily a couple years old).

    Anyway, all pretty minor stuff like I said.. thanks for the great work & thought you've obviously put into this project!

    enhancement compatibility 
    opened by n00bmind 1
  • Deadlock related to span->free_list_deferred

    Deadlock related to span->free_list_deferred

    I have integrated rpmalloc into our game-engine and have it working on Windows, XBOX*, Switch, & Playstation*. The project that I am testing it with does a lot of multithreaded memory allocation. Thread contention around memory allocation is what prompted me to experiment with your allocator. It has solved our thread contention issues, but I am getting an intermittent deadlock in _rpmalloc_span_extract_free_list_deferred() & _rpmalloc_deallocate_defer_small_or_medium(). It deadlocks waiting for span->free_list_deferred to be something other than INVALID_POINTER. I have seen the deadlock on both x86_64 & aarch64. _rpmalloc_span_extract_free_list_deferred() seemed a bit suspect to me so I attempted to make it 'safer' by changing it to the following and removing the if (atomic_load_ptr(&span->free_list_deferred)) in _rpmalloc_allocate_from_heap_fallback()

    _rpmalloc_span_extract_free_list_deferred(span_t* span) {
    	// We need acquire semantics on the CAS operation since we are interested in the list size
    	// Refer to _rpmalloc_deallocate_defer_small_or_medium for further comments on this dependency
    	void* free_list;
    	do {
    		free_list = atomic_exchange_ptr_acquire(&span->free_list_deferred, INVALID_POINTER);
    	} while (free_list == INVALID_POINTER);
    	if (free_list != 0) {
    		span->free_list = free_list;
    		span->used_count -= span->list_size;
    		span->list_size = 0;
    	}
    	atomic_store_ptr_release(&span->free_list_deferred, 0);
    }
    

    That seemed to help make the deadlock less frequent but I am still getting it. I have not seen it on Windows/XBOX, but I have seen it on Switch and Playstation (PS4/PS5) (I haven't tested Windows/XBox nearly as much, so it may be happening there as well). Any help with this would be greatly appreciated, it would be great if we could make rpmalloc work for us.

    compatibility 
    opened by bbi-michaelanderson 7
  • Compatibility with Microsoft runtime overloading

    Compatibility with Microsoft runtime overloading

    Hi, I try to use rpmalloc on a VS2019 C++ Win32 project (x64). I added the rpmalloc.vcxproj to my solution and set ENABLE_OVERRIDE=1 and ENABLE_PRELOAD=1 in both the rpmalloc lib project as well as my exe project. In the exe project I include rpmalloc.h in stdafx.h (no MFC).

    First gotcha is that I can't use rpmalloc with the debug build, because VS C++ uses _free_dbg in global delete. I therefore use rpmalloc only in the release build.

    After very short time (less than a second) I get a crash in rpmalloc when delete/free is called. With rpmalloc 1.4.0 the span pointer isn't correct which leads to an acces violation. With the current development branch (commit 9adf4e0aed0a60b22c9a4f10d20d674ca55a9f8c) I get a crash because the heap member in the span struct is zero while being dreferenced.

    What can I do to find out what's the culprit (I alreaded set ENABLE_ASSERTS=1 which didn't help)?

    The application is using boost::asio and a lot threads/executors and is working well with VS29019 onboard heap library. Sometimes within long term tests (24h+) I get delays when the PC wasn't used (in the sense that somebody moved the mouse and clicked something) for hours and then the user i.e. is opening a folder in Windows Explorer. I wanted to try out if these delays (we're talking round about <= 50ms) could be due to a heap lock, because the process is excessively using dynamic heap data (new and delete).

    Thank you and best regards.

    compatibility 
    opened by aholzinger 18
  • Interposing standard entry points broken on macOS

    Interposing standard entry points broken on macOS

    In certain situations, my system (OS X 10.14) tries to free a pointer with rpfree that was allocated with the system allocator. These can occur depending on how rpmalloc is integrated (currently, librpmallocwrap.dylib seems broken, so I'm using homebrewed integration). There are other situations where this is helpful too, e.g. https://github.com/jemalloc/jemalloc/blob/7014f81e172290466e1a28118b622519bbbed2b0/src/zone.c#L135 . Is there some way to determine this currently? If not, it'd be great if there was.

    compatibility 
    opened by michaeleisel 5
  • Different preallocation and caching strategy for systems that do not overcommit

    Different preallocation and caching strategy for systems that do not overcommit

    rpmalloc sort of assumes that memory blocks mapped in do not commit to a physical page until touched and map large chunks upfront to reduce system call overhead.

    For systems that do commit physical pages immediately this is wasteful and needs a different strategy.

    enhancement 
    opened by mjansson 0
  • Add malloc_trim

    Add malloc_trim

    http://man7.org/linux/man-pages/man3/malloc_trim.3.html

    This is a GNU extension, but it seems a lot of other allocators support it as well. It would be useful to have in e.g. system wide low-memory conditions.

    enhancement 
    opened by waddlesplash 3
Releases(1.4.4)
  • 1.4.4(May 30, 2022)

    Fixed an issue where an external thread concurrently freeing a block to the deferred list of a heap at the same time as owner thread freeing the last used block could cause a race condition ending in span being freed multiple time.

    Added fallback path when huge page allocation fails to allocate and promote new pages as a transparent huge page

    Added option to name pages on Linux and Android.

    Compilation compatibility updates for MSYS2, FreeBSD, MacOS/clang and tinycc.

    Source code(tar.gz)
    Source code(zip)
  • 1.4.3(Aug 6, 2021)

    • Fixed an issue where certain combinations of memory page size and span map counts could cause a deadlock in the mapping of new memory pages.

    • Tweaked cache levels and avoid setting spans as reserved in a heap when the heap already has spans in the thread cache to improve cache usage.

    • Prefer flags to more actively evict physical pages in madvise calls when partially unmapping span ranges on POSIX systems.

    Source code(tar.gz)
    Source code(zip)
  • 1.4.2(Apr 25, 2021)

    • Fixed an issue where calling _exit might hang the main thread cleanup in rpmalloc if another worker thread was terminated while holding exclusive access to the global cache.

    • Improved caches to prioritize main spans in a chunk to avoid leaving main spans mapped due to remaining subspans in caches.

    • Improve cache reuse by allowing large blocks to use caches from slightly larger cache classes.

    • Fixed an issue where thread heap statistics would go out of sync when a free span was deferred to another thread heap

    • API breaking change - added flag to rpmalloc_thread_finalize to avoid releasing thread caches. Pass nonzero value to retain old behaviour of releasing thread caches to global cache.

    • Add option to config to set a custom error callback for assert failures (if ENABLE_ASSERT)

    • Minor code changes to improve codegen

    Source code(tar.gz)
    Source code(zip)
  • 1.4.1(Aug 26, 2020)

    • Dual license as both released to public domain or under MIT license

    • Allow up to 4GiB memory page sizes

    • Fix an issue where large page sizes in conjunction with many threads waste a lot of memory (previously each heap occupied an entire memory page, now heaps can now share a memory page)

    • Fixed compilation issue on macOS when ENABLE_PRELOAD is set but not ENABLE_OVERRIDE

    • New first class heap API allowing explicit heap control and release of entire heap in a single call

    • Added rpaligned_calloc function for aligned and zero intialized allocations

    • Fixed natural alignment check in rpaligned_realloc to 16 bytes (check was 32, which is wrong)

    • Minor performance improvements for all code paths by simplified span handling

    • Minor performance improvements and for aligned allocations with alignment less or equal to 128 bytes by utilizing natural block alignments

    • Refactor finalization to be compatible with global scope data causing dynamic allocations and frees, like C++ objects with custom ctors/dtors

    • Refactor thread and global cache to be array based instead of list based for improved performance and cache size control

    • Added missing C++ operator overloads with ENABLE_OVERRIDE when using Microsoft C++ runtimes

    • Fixed issue in pvalloc override that could return less than a memory page in usable size

    • Added a missing null check in the non-hot allocation code paths

    Source code(tar.gz)
    Source code(zip)
  • 1.4.0(Aug 8, 2019)

    • Improved cross thread deallocations by using per-span atomic free list to minimize thread contention and localize free list processing to actual span

    • Change span free list to a linked list, conditionally initialized one memory page at a time

    • Reduce number of conditionals in the fast path allocation and avoid touching heap structure at all in best case

    • Avoid realigning block in deallocation unless span marked as used by alignment > 32 bytes

    • Revert block granularity and natural alignment to 16 bytes to reduce memory waste

    • Bugfix for preserving data when reallocating a previously aligned (>32 bytes) block

    • Use compile time span size by default for improved performance, added build time RPMALLOC_CONFIGURABLE preprocessor directive to reenable configurability of span and page size

    • More detailed statistics

    • Disabled adaptive thread cache by default

    • Fixed an issue where reallocations of large blocks could read outsize of memory page boundaries

    • Tag mmap requests on macOS with tag 240 for identification with vmmap tool

    Source code(tar.gz)
    Source code(zip)
  • 1.3.2(May 29, 2019)

    Support for alignment equal or larger than memory page size, up to span size

    Added adaptive thread cache size based on thread allocation load

    Support preconfigured huge pages

    Fix 32-bit MSVC Windows builds using incorrect 64-bit pointer CAS

    Updated compatibility with clang toolchain and Python 3

    Moved active heap counter to statistics

    Moved repository to https://github.com/mjansson/rpmalloc

    Source code(tar.gz)
    Source code(zip)
  • 1.3.1(Apr 28, 2018)

    Support for huge pages

    Bugfix to old size in aligned realloc and usable size for aligned allocs when alignment > 32

    Use C11 atomics for non-Microsoft compilers

    Remove remaining spin-lock like control for caches, all operations are now lock free

    Allow large deallocations to cross thread heaps

    Source code(tar.gz)
    Source code(zip)
  • 1.3.0(Feb 14, 2018)

    Make span size configurable and all spans equal in size, removing span size classes and streamlining the thread cache.

    Allow super spans to be reserved in advance and split up in multiple used spans to reduce number of system calls. This will not increase committed physical pages, only reserved virtual memory space.

    Allow super spans to be reused for allocations of lower size, breaking up the super span and storing remainder in thread cache in order to reduce load on global cache and reduce cache overhead.

    Fixed an issue where an allocation of zero bytes would cause a segmentation fault from indexing size class array with index -1.

    Fixed an issue where an allocation of maximum large block size (2097120 bytes) would index the heap cache array out of bounds and potentially cause a segmentation fault depending on earlier allocation patterns.

    Fixed an issue where memory pages at start of aligned span run was not completely unmapped on POSIX systems.

    Fixed an issue where spans were not correctly marked as owned by the heap after traversing the global span cache.

    Added function to access the allocator configuration after initialization to find default values.

    Removed allocated and reserved statistics to reduce code complexity.

    Source code(tar.gz)
    Source code(zip)
  • 1.2.2(Jan 24, 2018)

    Add configurable memory mapper providing map/unmap of memory pages. Default to VirtualAlloc/mmap if none provided. This allows rpmalloc to be used in contexts where memory is provided by internal means.

    Avoid using explicit memory map addresses to mmap on POSIX systems. Instead use overallocation of virtual memory space to gain 64KiB alignment of spans. Since extra pages are never touched this should have no impact on real memory usage and remove the possibility of contention in virtual address space with other uses of mmap.

    Detect system memory page size at initialization, and allow page size to be set explicitly in initialization. This allows the allocator to be used as a sub-allocator where the page granularity should be lower to reduce risk of wasting unused memory ranges, and adds support for modern iOS devices where page size is 16KiB.

    Add build time option to use memory guards, surrounding each allocated block with a dead zone which is checked for consistency when block is freed.

    Always finalize thread on allocator finalization, fixing issue when re-initializing allocator in the same thread.

    Add basic allocator test cases

    Source code(tar.gz)
    Source code(zip)
  • 1.2.1(Dec 20, 2017)

    Split library into rpmalloc only base library and preloadable malloc wrapper library.

    Add arg validation to valloc and pvalloc.

    Change ARM memory barrier instructions to dmb ish/ishst for compatibility.

    Improve preload compatibility on Apple platforms by using pthread key for TLS in wrapper library.

    Fix ABA issue in orphaned heap linked list

    Source code(tar.gz)
    Source code(zip)
  • 1.2(Jun 20, 2017)

    Dual license under MIT

    Fix init/fini checks in malloc entry points for preloading into binaries that does malloc/free in init or fini sections

    Fixed an issue where freeing a block which had been realigned during allocation due to alignment request greater than 16 caused the free block link to be written in the wrong place in the block, causing next allocation from the size class to return a bad pointer

    Improve mmap 64KiB granularity enforcement loop to avoid excessive iterations

    Fix undersized adaptive cache counter array for large block in heap structure, causing potential abort on exit

    Avoid hysteresis in realloc by overallocating on small size increases

    Add entry point for realloc with alignment and optional flags to avoid preserving content

    Add valloc/pvalloc/cfree wrappers

    Add C++ new/delete wrappers

    Source code(tar.gz)
    Source code(zip)
  • 1.1(Apr 5, 2017)

    Add four cache presets (unlimited, performance priority, size priority and no cache)

    Slight performance improvement by dependent class index lookup for merged size classes

    Adaptive cache size per thread and per size class for improved memory efficiency, and release thread caches to global cache in fixed size batches

    Merged caches for small/medium classes using 64KiB spans with 64KiB large blocks

    Require thread initialization with rpmalloc_thread_initialize, add pthread hooks for automatic init/fini

    Added rpmalloc_usable_size query entry point

    Fix invalid old size in memory copy during realloc

    Optional statistics and integer overflow guards

    Optional asserts for easier debugging

    Provide malloc entry point replacements and automatic init/fini hooks, and a LD_PRELOAD:able dynamic library build

    Improve documentation and additional code comments

    Move benchmarks to separate repo, https://github.com/rampantpixels/rpmalloc-benchmark

    Source code(tar.gz)
    Source code(zip)
  • 1.0(Mar 15, 2017)

Owner
Mattias Jansson
Electronic alchemist
Mattias Jansson
STL compatible C++ memory allocator library using a new RawAllocator concept that is similar to an Allocator but easier to use and write.

STL compatible C++ memory allocator library using a new RawAllocator concept that is similar to an Allocator but easier to use and write.

Jonathan Müller 1k Dec 2, 2021
Malloc Lab: simple memory allocator using sorted segregated free list

LAB 6: Malloc Lab Main Files mm.{c,h} - Your solution malloc package. mdriver.c - The malloc driver that tests your mm.c file short{1,2}-bal.rep - T

null 1 Feb 28, 2022
The Hoard Memory Allocator: A Fast, Scalable, and Memory-efficient Malloc for Linux, Windows, and Mac.

The Hoard Memory Allocator Copyright (C) 1998-2020 by Emery Berger The Hoard memory allocator is a fast, scalable, and memory-efficient memory allocat

Emery Berger 927 Jan 2, 2023
Mesh - A memory allocator that automatically reduces the memory footprint of C/C++ applications.

Mesh: Compacting Memory Management for C/C++ Mesh is a drop in replacement for malloc(3) that can transparently recover from memory fragmentation with

PLASMA @ UMass 1.5k Dec 30, 2022
A tiny portable C89 memory allocator

mem A tiny portable C89 memory allocator. Usage This is a single-header library. You must include this file alongside #define MEM_IMPLEMENTATION in on

null 11 Nov 20, 2022
Allocator bench - bench of various memory allocators

To run benchmarks Install lockless from https://locklessinc.com/downloads/ in lockless_allocator path make Install Hoard from https://github.com/emery

Sam 47 Dec 4, 2022
Cross-platform shared memory stream/buffer, header-only library for IPC in C/C++.

libsharedmemory libsharedmemory is a small C++11 header-only library for using shared memory on Windows, Linux and macOS. libsharedmemory makes it eas

Aron Homberg 10 Dec 4, 2022
Alloc-test - Cross-platform benchmarking for memory allocators, aiming to be as close to real world as it is practical

Alloc-test - Cross-platform benchmarking for memory allocators, aiming to be as close to real world as it is practical

null 37 Aug 23, 2022
mimalloc is a compact general purpose allocator with excellent performance.

mimalloc mimalloc (pronounced "me-malloc") is a general purpose allocator with excellent performance characteristics. Initially developed by Daan Leij

Microsoft 7.6k Dec 30, 2022
Hardened malloc - Hardened allocator designed for modern systems

Hardened malloc - Hardened allocator designed for modern systems. It has integration into Android's Bionic libc and can be used externally with musl and glibc as a dynamic library for use on other Linux-based platforms. It will gain more portability / integration over time.

GrapheneOS 893 Jan 3, 2023
Snmalloc - Message passing based allocator

snmalloc snmalloc is a high-performance allocator. snmalloc can be used directly in a project as a header-only C++ library, it can be LD_PRELOADed on

Microsoft 1.1k Jan 9, 2023
Custom memory allocators in C++ to improve the performance of dynamic memory allocation

Table of Contents Introduction Build instructions What's wrong with Malloc? Custom allocators Linear Allocator Stack Allocator Pool Allocator Free lis

Mariano Trebino 1.4k Jan 2, 2023
MMCTX (Memory Management ConTeXualizer), is a tiny (< 300 lines), single header C99 library that allows for easier memory management by implementing contexts that remember allocations for you and provide freeall()-like functionality.

MMCTX (Memory Management ConTeXualizer), is a tiny (< 300 lines), single header C99 library that allows for easier memory management by implementing contexts that remember allocations for you and provide freeall()-like functionality.

A.P. Jo. 4 Oct 2, 2021
Memory-dumper - A tool for dumping files from processes memory

What is memory-dumper memory-dumper is a tool for dumping files from process's memory. The main purpose is to find patterns inside the process's memor

Alexander Nestorov 31 Nov 9, 2022
Custom implementation of C stdlib malloc(), realloc(), and free() functions.

C-Stdlib-Malloc-Implementation NOT INTENDED TO BE COMPILED AND RAN, DRIVER CODE NOT OWNED BY I, ARCINI This is a custom implmentation of the standard

Alex Cini 1 Dec 27, 2021
OpenXenium JTAG and Flash Memory programmer

OpenXenium JTAG and Flash Memory programmer * Read: "Home Brew" on ORIGINAL XBOX - a detailed article on why and how * The tools in this repo will all

Koos du Preez 29 Oct 23, 2022
manually map driver for a signed driver memory space

smap manually map driver for a signed driver memory space credits https://github.com/btbd/umap tested system Windows 10 Education 20H2 UEFI installati

ekknod 89 Dec 17, 2022
Memory instrumentation tool for android app&game developers.

Overview LoliProfiler is a C/C++ memory profiling tool for Android games and applications. LoliProfiler supports profiling debuggable applications out

Tencent 491 Jan 6, 2023
A single file drop-in memory leak tracking solution for C++ on Windows

MemLeakTracker A single file drop-in memory leak tracking solution for C++ on Windows This small piece of code allows for global memory leak tracking

null 22 Jul 18, 2022