Cold-start page provisioning speed test for WIndows

Overview

largepages

This is a rudimentary test of cold-start memory paging on Windows that I put together for Raymond Chen's Tie, who was kind enough to ask its owner to write a blog post about the architectural reasons Windows doesn't yet have generally available large page support.

The idea behind this utility is to see how long it takes Windows to actually provide memory for use under regular 4k-page demand-paged VirtualAlloc vs. "large" 2mb-page physically-locked VirtualAlloc. This is an important metric for command-line utilities that are trying to be very fast, such as sub-second compilers, but should be generally applicable to any scenario where the memory footprint of an application is growing.

To build, use the accompanying build.bat. It builds with MSVC by default, but you can un-REM the clang++ line to build with clang if you have it installed.

Usage

Option Description
--large Use 2mb pages (default: 4k pages)
[number] Set total allocation size in mb (default: 1024)

Usage examples:

Command line Description
largepages tests provisioning of 1gb of 4k pages
largepages --large tests provisioning of 1gb of 2mb pages
largepages 512 tests provisioning of 512mb of 4k pages
largepages --large 256 tests provisioning of 256mb of 2mb pages

IMPORTANT

In order to test 2mb pages, you MUST have enabled the Windows group policy setting that allows locking memory. If you do not do this, all attempts to test 2mb pages will fail, as Windows does not allow executables to do this in general. If you don't know how to enable this setting, you can use this guide from Microsoft for SQL Server, but when it comes time to choose a user, choose the user who will run largepages, not the SQL service.

Discussion

In general, this test is less about 4k vs. 2mb pages, and more about the cost of provisioning pages on demand rather than in bulk. In Windows, 4k pages are provisioned on demand as they are touched, whereas 2mb pages are provisioned immediately upon allocation.

In this simple benchmark, 2mb pages outperform 4k pages dramatically - by numbers like 10x. Although part of this speedup may come from the larger sizes of the pages themselves, it is likely that most of the speedup comes from the fact that 2mb pages appear to be provisioned directly in VirtualAlloc (as you would expect for physically-locked addresses). This means they do not need to be faulted and provisioned later. I would assume - but don't know - that a simple flag to VirtualAlloc like "MEM_USE_IMMEDIATELY" that told VirtualAlloc to make 4k pages resident right away, like ther 2mb counterparts, would make 4k performance closely resemble 2mb performance.

Alas, no such flag exists, so this is purely hypothetical at the moment.

- Casey

Issues
  • Using threaded warmup

    Using threaded warmup

    I'm not sure if it is relevant to this repo, but I thought I'd share some results of my experimentation:

    This PR adds --threads command line argument that splits the walking into a thread per core. On my 6-core machine this gives an 40% boost from ~110ms to ~65ms, roughly the same as what I get with --rio.

    I am not sure I can explain why large pages are so much faster (~2ms) given that the kernel still needs to zero the memory and SecureZeroMemory on an already pre-warmed (fully committed) 1Gb takes ~50ms for me regardless of whether the regular or large pages are used. I don't know much about CPUs/MMUs/Kernels so probably there is a more low-level way to clear the memory.

    But a more important thing about the threaded setup I have here is that since the WarmUpThread is not writing anything to the memory it can race with anything else that goes on in your program. A single warmup thread that runs in a background will still likely outpaces anything else that happens in your program making sure the memory is there when you need it.

    In programs where I don't know how much memory I will need (think a compiler) I am now using an even more elaborate version of this where each arena has a background thread that runs some megabytes ahead of what is actually used and warms up the memory there. This thread only wakes up when the arena signals via SetEvent that a new chunk should be pre-warmed providing very low overhead. This setup basically eliminated arena page faults from my performance profiles and provided significant wall-clock time savings.

    opened by grassator 2
Owner
Casey Muratori
Programmer at Molly Rocket on 1935 and host of Handmade Hero
Casey Muratori
A simple utility that cold patches dwm (uDWM.dll) in order to disable window rounded corners in Windows 11

Win11DisableRoundedCorners A simple utility that cold patches the Desktop Window Manager (uDWM.dll) in order to disable window rounded corners in Wind

Valentin-Gabriel Radu 387 Jun 27, 2022
Speed Running and Competition Doom. For strictly vanilla speed runs and competitions - forked from CNDoom

Speed Running and Competition Doom Speed Running and Competition Doom is based on Chocolate Doom and aims to accurately reproduce the original DOS ver

Gibbon 3 May 24, 2022
A remote start arduino sketch, written for a Volkswagen Golf Gti MK4. Icons provided by Icons8

Introduction ?? This is an arduino sketch that enables the use of a sim-reader in order to remotely start a vehicle Description This Arduino sketch is

Sivert 1 Feb 18, 2022
Just getting started with Data Structure and Algorithms? Make your first contribution here and start the journey of learning DSA.

Getting Started ! ✨ If you are just beginning with open source then let's make your first contribution in this repository ! Contributing Tutorial ?? P

amega 3 Apr 18, 2022
A continuation of FSund's pteron-keyboard project. Feel free to contribute, or use these files to make your own! Kits and PCBs are also available through my facebook page.

pteron-pcb Intro This project is the evolution of the Pteron-Keyboard project, an incredible ergonomic keyboard that was handwired only. I aimed to in

null 15 Mar 20, 2022
Exploring the Design Space of Page Management for Multi-Tiered Memory Systems (USENIX ATC'21)

AutoTiering This repo contains the kernel code in the following paper: Exploring the Design Space of Page Management for Multi-Tiered Memory Systems (

Computer Systems Laboratory @ Ajou University 13 Mar 31, 2022
ERASOR - Official page of ERASOR (RA-L'21 with ICRA'21)

?? ERASOR (RA-L'21 with ICRA Option) Official page of "ERASOR: Egocentric Ratio of Pseudo Occupancy-based Dynamic Object Removal for Static 3D Point C

Hyungtae Lim 161 Jun 28, 2022
A split-screen menu page

split_screen_menu A split-screen menu page Getting Started ###isMobileLayout retun bool SplitScreenMenu menu Widget initialRoute String? initPage Widg

Chans 4 Feb 19, 2022
Creates an AP with a web page interface that allows configuration of ESP for local WiFi network.

Creates an AP with a web page interface that allows configuration of ESP for local WiFi network.

null 3 Feb 5, 2022
Program that allows you to get the source code of a website's home page without doing it manually. Use it at your own risk.

Website-Homepage-Grabber Install one of the folders x64 or x32 if the program doesn't work(probably because you don't have visual studio installed) If

null 5 Feb 19, 2022
Official page of MLCPP (IROS'18 @ Barcelona, Spain): Offline Coverage Path Planner

MLCPP: Multi-layer coverage path planner for autonomous structural inspection of high-rise structures The purpose of the algorithm is to inspect high-

Sungwook Jung 11 May 26, 2022
CRC32 slice-by-16 implementation in JS with an optional native binding to speed it up even futher

CRC32 slice-by-16 implementation in JS with an optional native binding to speed it up even futher. When used with Webpack/Browserify etc, it bundles the JS version.

Mathias Buus 8 Aug 4, 2021
Arduino code for a high speed 8000hz wired mouse using a teensy 4 MCU

teensy4_mouse Arduino code for a high speed 8000Hz wired mouse using a teensy 4 MCU. This code is inspired by https://github.com/mrjohnk/PMW3360DM-T2Q

Herbert Trip 7 Mar 11, 2022
Bindings, from the comfort and speed of C++ and without Qt.

KDBindings Bindings, from the comfort and speed of C++ and without Qt. From plain C++ you get: Signals + Slots. Properties templated on the contained

KDAB 179 Jun 23, 2022
Repository Containing the Code associated with the Paper: "Learning High-Speed Flight in the Wild"

Learning High-Speed Flight in the Wild This repo contains the code associated with the paper Learning Agile Flight in the Wild. For more information,

Robotics and Perception Group 338 Jun 27, 2022
Anti-Grain Evolution. 2D graphics engine for Speed and Quality in C++.

Anti-Grain Evolution This project is based on ideas found in Maxim (mcseem) Shemanarev's Anti-Grain Geometry library, but oriented towards maximizing

Artem G. 93 Apr 18, 2022
Use this to speed up your final project and reduce code bloat

224 Superior Serial.print statements Use this to speed up your final project and reduce code bloat! And we learn about printing formatted strings usin

Ralph Bacon 24 Jun 15, 2022
A Gen implementation in C. With memory efficiency, portability and speed in mind

A Gen implementation in C. With memory efficiency, portability and speed in mind

Gen Programming Language 3 May 11, 2022
Speed-up Version of ORB_SLAM3 by TBB library

ORB-SLAM3 Custom version, January 31st, 2022 Parallelize ORB feature extraction by TBB library, along with new update in V1.0, the speed is over real-

Long Vuong 4 Apr 16, 2022