Bistro: A fast, flexible toolkit for scheduling and running distributed tasks

Overview

Bistro: A fast, flexible toolkit for scheduling and running distributed tasks

Build Status

This README is a very abbreviated introduction to Bistro. Visit http://facebook.github.io/bistro for a more structured introduction, and for the docs.

Bistro is a toolkit for making distributed computation systems. It can schedule and run distributed tasks, including data-parallel jobs. It enforces resource constraints for worker hosts and data-access bottlenecks. It supports remote worker pools, low-latency batch scheduling, dynamic shards, and a variety of other possibilities. It has command-line and web UIs.

Some of the diverse problems that Bistro solved at Facebook:

  • Safely run map-only ETL tasks against live production databases (MySQL, HBase, Postgres).
  • Provide a resource-aware job queue for batch CPU/GPU compute jobs.
  • Replace Hadoop for a periodic online data compression task on HBase, improving time-to-completion and reliability by over 10x.

You can run Bistro "out of the box" to suit a variety of different applications, but even so, it is a tool for engineers. You should be able to get started just by reading the documentation, but when in doubt, look at the code --- it was written to be read.

Some applications of Bistro may involve writing small plugins to make it fit your needs. The code is built to be extensible. Ask for tips, and we'll do our best to help. In return, we hope that you will send a pull request to allow us to share your work with the community.

Early release

Although Bistro has been in production at Facebook for over 3 years, the present public release is partial, including just the server components.

Install the dependencies and build

Bistro needs a 64-bit Linux, Folly, FBThrift, Proxygen, boost, and libsqlite3. You need 2-3GB of RAM to build, as well as GCC 4.9 or above.

build/README.md documents the usage of Docker-based scripts that build Bistro on Ubuntu 14.04, 16.04, and Debian 8.6. You should be able to follow very similar steps on most modern Linux distributions.

If you run into dependency problems, look at bistro/cmake/setup.cmake for a full list of Bistro's external dependencies (direct and indirect). We gratefully accept patches that improve Bistro's builds, or add support for various flavors of Linux and Mac OS.

The binaries will be in bistro/cmake/{Debug,Release}. Available build targets are explained here: http://cmake.org/Wiki/CMake_Useful_Variables#Compilers_and_Tools You can start Bistro's unit tests by running ctest in those directories.

Your first Bistro run

This is just one simple demo, but Bistro is a very flexible tool. Refer to http://facebook.github.io/bistro/ for more in-depth information.

We are going to start a single Bistro scheduler talking to one 'remote' worker.

Aside: The scheduler tracks jobs, and data shards on which to execute them. It also makes sure only to start new tasks when the required resources are available. The remote worker is a module for executing centrally scheduled work on many machines. The UI can aggregate many schedulers at once, so using remote workers is optional --- a share-nothing, many-scheduler system is sometimes preferable.

Let's make a task to execute:

&2 echo "done" > "\$2" # Report the task status to Bistro via a named pipe EOF chmod u+x ~/demo_bistro_task.sh ">
cat <
   
     ~/demo_bistro_task.sh
#!/bin/bash
echo "I got these arguments: \[email protected]"
echo "stderr is also logged" 1>&2
echo "done" > "\$2"  # Report the task status to Bistro via a named pipe
EOF
chmod u+x ~/demo_bistro_task.sh

   

Open two terminals, one for the scheduler, and one for the worker.

# In both terminals
cd bistro/bistro
# Start the scheduler in one terminal
./cmake/Debug/server/bistro_scheduler \
  --server_port=6789 --http_server_port=6790 \
  --config_file=scripts/test_configs/simple --clean_statuses \
  --CAUTION_startup_wait_for_workers=1 --instance_node_name=scheduler
# Start the worker in another
mkdir /tmp/bistro_worker
./cmake/Debug/worker/bistro_worker --server_port=27182 --scheduler_host=:: \
  --scheduler_port=6789 --worker_command="$HOME/demo_bistro_task.sh" \
  --data_dir=/tmp/bistro_worker

You should be seeing some lively log activity on both terminals. In several seconds, the worker-scheduler negotiation should complete, and you should see messages like "Task ... quit with status" and "Got status".

Since we passed --clean_statuses, the scheduler will not persist any task completions that happened during this run. The worker, on the other hand, will keep a record of the task logs in /tmp/bistro_worker/task_logs.sql3.

If you want task completions to persist across runs, tell Bistro where to put the SQLite database, via --data_dir=/tmp/bistro_scheduler and --status_table=task_statuses

mkdir /tmp/bistro_scheduler
./cmake/Debug/server/bistro_scheduler \
  --server_port=6789 --http_server_port=6790 \
  --config_file=scripts/test_configs/simple \
  --data_dir=/tmp/bistro_scheduler --status_table=task_statuses \
  --CAUTION_startup_wait_for_workers=1 --instance_node_name=scheduler

You can query the running scheduler via its REST API:

curl -d '{"a":{"handler":"jobs"},"b":{"handler":"running_tasks"}}' :::6790
curl -d '{"my subquery":{"handler":"task_logs","log_type":"stdout"}}' :::6790

Pro-tip: For ease of reading, pipe the output through either jq or json_pp (from a Perl package). For longer outputs, try | jq -C . | less -R.

You should also take a look at the scheduler configuration to see how its jobs, nodes, and resources were specified.

less scripts/test_configs/simple

For debugging, we typically invoke the binaries like this:

&1 | tee WORKER.txt ">
gdb cmake/Debug/worker/bistro_worker -ex "r ..." 2>&1 | tee WORKER.txt

When configuring a real deployment, be sure to carefully review the --help of the scheduler & worker binaries, as well as the documentation on http://facebook.github.io/bistro. And don't hesitate to ask for help in the group: https://www.facebook.com/groups/bistro.scheduler

License

See LICENSE.

Issues
  • Trouble when using host physical resources

    Trouble when using host physical resources

    Hi,

    I'm trying to use bistro Discovering available physical resources capability in the toy example of the README without success. I run into two problems:

    Bistro seems not to find any of the resources of my computer:

    Running:

    ./cmake/Debug/worker/bistro_worker --scheduler_host=:: --scheduler_port=6789 --worker_command="$HOME/demo_bistro_task.sh" --data_dir=/tmp/bistro_worker --nvidia_smi=/usr/bin/nvidia-smi

    Among other, get this output (worker terminal):

    I0515 18:47:10.139637 107952 BistroWorkerHandler.cpp:100] Worker is ready: BistroWorker { [...] 7: usableResources (struct) = UsablePhysicalResources { 1: msSinceEpoch (i64) = 0, 2: cpuCores (double) = 0.0, 3: memoryMB (double) = 0.0, 4: gpus (list) = list[0] { }, }, }

    It may be only some synchronization error as I am able to get my real computer configuration by running the following commands (the order is important): run bistro_scheduler then run bistro_worker then kill bistro_scheduler then run bistro_scheduler again (scheduler terminal):

    I0515 18:12:31.293471 107047 AutoTimer.h:143] Got 0 running tasks from worker BistroWorker { [...] 7: usableResources (struct) = UsablePhysicalResources { 1: msSinceEpoch (i64) = 1589559121458, 2: cpuCores (double) = 128.0, 3: memoryMB (double) = 128737.11328125, 4: gpus (list) = list[3] { [0] = GPUInfo { 1: name (string) = "GeForce RTX 2080 SUPER", 2: pciBusID (string) = "00000000:01:00.0", 3: memoryMB (double) = 7979.0, 4: compute (double) = 1.0, }, [1] = GPUInfo { 1: name (string) = "GeForce RTX 2080 SUPER", 2: pciBusID (string) = "00000000:21:00.0", 3: memoryMB (double) = 7982.0, 4: compute (double) = 1.0, }, [2] = GPUInfo { 1: name (string) = "GeForce RTX 2080 SUPER", 2: pciBusID (string) = "00000000:4B:00.0", 3: memoryMB (double) = 7982.0, 4: compute (double) = 1.0, }, },

    I'm not able to properly configure the bistro_settings with physical resources

    My setting file is the following:

    {
    
    "bistro_settings" : {
      "resources" : {
        "worker" : {
          "ram" : {
            "limit" : 0,
            "default" : 0
          },
          "cpu" : {
            "limit" : 0,
            "default" : 1
          },
          "gpu" : {
            "limit" : 0,
            "default" : 0
          }
        }
      },
      "nodes" : {
        "levels": ["worker", "level1", "level2"],
        "node_sources": [{
          "source": "manual",
          "prefs": {
            "node1": ["node11", "node12"],
            "node2": ["node21", "node22"]
          }
        }]
      },
      "enabled" : true,
      "physical_resources": {
        "ram_mb": {
            "logical_resource": "ram",
            "multiply_logical_by": 1024,
            "physical_reserve_amount": 4096
        },
        "cpu_core": {
            "logical_resource": "cpu",
            "enforcement": "none"
        },
        "gpu_card": {
            "logical_resource": "gpu"
        }
      },
    },
    
    
    "bistro_job->simple_job" : {
      "owner" : "test",
      "enabled" : true
    }
    }
    

    So if I had well understood the documentation all of my node will require 1 cpu to be run by the simple_job. And I expect that the limit of ram, cpu and gpu will be updated accordingly to the worker capability. But when I'm running the simple example with this configuration file, no jobs are launched and the scheduler is waiting for a worker with enough capability.

    I would be very grateful if you can give me some insight for solving my problem.

    Nathan

    opened by npiasco 7
  • How can achieve HA/Failover of scheduler

    How can achieve HA/Failover of scheduler

    Hi,

    My company is interested in using Bistro for our task distributed system. We are reading the design and code of bistro, one important factor for us is how to achieve high availability of scheduler. Can you let me know if this is implemented? If not, how can I achieve it, where is the best place I can add HA logic?

    Best regards Nathan

    opened by nwong4932 5
  • Docker build not working (FunctionInfo.h missing)

    Docker build not working (FunctionInfo.h missing)

    Hey,

    I just tried compiling bistro through Docker and was not able to complete the process. Somewhere during the build the compiler throws an error that FunctionInfo.h is missing from the fbthrift dependency.

    After digging through the container I was not able to recover log file with the error message. It would be great to get some info as to where these messages get stored. Running find /home -type f -name *.log -exec /bin/bash -c "cat {} | grep FunctionInfo" did not return anything useful.

    I did manage to find a probable cause for this issue by searching for references to FunctionInfo.h in the source files (/home/*) and working backwards from there. The Makefile.am in the fbthrift repository includes only the transport/core/TransportRoutingHandler.h and transport/core/ThriftProcessor.h headers. My guess is that FunctionInfo.h should also be included here.

    https://github.com/facebook/fbthrift/blob/76d376e1b7d0f189708ef6438abb40862be360d5/thrift/lib/cpp2/Makefile.am#L182-L184

    opened by menzow 5
  • Building Docker image fails: liblib_bistro_if.a(common_types.cpp.o): undefined reference to symbol

    Building Docker image fails: liblib_bistro_if.a(common_types.cpp.o): undefined reference to symbol

    Hi, i am struggling with building bistro in docker and i am not sure what is wrong. I am quite newbie with c++ so i really dont know what is the issue and if it is on my side.

    Here are steps to reproduce:

    git clone https://github.com/facebook/bistro.git
    cd bistro/build && ./fbcode_builder/make_docker_context.py \
        --make-parallelism=$(nproc) \
        --docker-context-dir=../../ \
        --local-repo-dir="../" \
        --os-image="ubuntu:18.04" \
        --gcc-version=7
    cd ../.. && docker build .
    

    You can check attempt to automate the build: https://github.com/rectorphp/docker-base-bistro-image-builder/blob/master/.github/workflows/public-docker-image.yml

    Here is the most recent (and i think the most relevant) logs:

    make[1]: *** Waiting for unfinished jobs....
    13958
    [ 46%] Linking CXX executable test_cgroup_resources
    13959
    cd /home/bistro/bistro/cmake/Debug/physical/test && /usr/bin/cmake -E cmake_link_script CMakeFiles/test_cgroup_resources.dir/link.txt --verbose=1
    13960
    /usr/bin/c++  -g  -rdynamic CMakeFiles/test_cgroup_resources.dir/test_cgroup_resources.cpp.o  -o test_cgroup_resources  -L/home/install/lib -Wl,-rpath,/home/install/lib ../../cmake/deps/gtest-1.8.1/googlemock/gtest/libgtestd.a ../../libfolly_gtest_main.a ../libphysical_lib.a -lfolly -lfmt -lglog -lgflags -lboost_context -lboost_date_time -lboost_regex -lboost_system -lboost_thread -lboost_filesystem -ldouble-conversion -lproxygenhttpserver -lproxygen -lcrypto -lfizz -lpthread -lsqlite3 -lwangle -lssl -lsodium -lz -lzstd -lasync -lconcurrency -lprotocol -lthrift-core -lthriftcpp2 -lthriftmetadata -lthriftprotocol -ltransport -pthread ../../processes/libsubprocess_with_timeout.a ../../utils/libexception_lib.a ../../utils/libutils_lib.a ../../if/liblib_bistro_if.a ../../sqlite/libsqlite_lib.a -lfolly -lfmt -lglog -lgflags -lboost_context -lboost_date_time -lboost_regex -lboost_system -lboost_thread -lboost_filesystem -ldouble-conversion -lproxygenhttpserver -lproxygen -lcrypto -lfizz -lpthread -lsqlite3 -lwangle -lssl -lsodium -lz -lzstd -lasync -lconcurrency -lprotocol -lthrift-core -lthriftcpp2 -lthriftmetadata -lthriftprotocol -ltransport 
    13961
    /usr/bin/ld: ../../if/liblib_bistro_if.a(common_types.cpp.o): undefined reference to symbol '_ZN6apache6thrift6detail2st20translate_field_nameEN5folly5RangeIPKcEERsRNS0_8protocol5TTypeERKNS2_26translate_field_name_tableE'
    13962
    //home/install/lib/librpcmetadata.so: error adding symbols: DSO missing from command line
    13963
    collect2: error: ld returned 1 exit status
    13964
    make[2]: *** [physical/test/test_cgroup_resources] Error 1
    13965
    physical/test/CMakeFiles/test_cgroup_resources.dir/build.make:102: recipe for target 'physical/test/test_cgroup_resources' failed
    13966
    make[2]: Leaving directory '/home/bistro/bistro/cmake/Debug'
    13967
    make[1]: *** [physical/test/CMakeFiles/test_cgroup_resources.dir/all] Error 2
    13968
    CMakeFiles/Makefile2:4808: recipe for target 'physical/test/CMakeFiles/test_cgroup_resources.dir/all' failed
    13969
    make[1]: Leaving directory '/home/bistro/bistro/cmake/Debug'
    13970
    Makefile:140: recipe for target 'all' failed
    13971
    

    As i tried to automate build using github actions, logs are here: https://github.com/rectorphp/docker-base-bistro-image-builder/runs/1149345804?check_suite_focus=true

    opened by JanMikes 4
  • Cannot build Bistro on Debian 9

    Cannot build Bistro on Debian 9

    When i run run_cmake.sh I get the following error. [email protected]:/bistro/bistro/bistro/cmake$ ./run-cmake.sh Debug /bistro/bistro/bistro/cmake/deps /bistro/bistro/bistro/cmake /bistro/bistro/bistro/cmake Generating Thrift Files /bistro /bistro/bistro/bistro !!! Unrecognized option: /bistro/bistro/bistro/cmake/../../.. Usage: thrift [options] file

    opened by arghasen 4
  • Docker build is failing

    Docker build is failing

    Hi , I am trying to install bistro in below environment . os : ubuntu:16.04 gcc : 5 This is blocking ,please suggest any quick around for installation of any stable version of bistro .

    Below is the error i am getting in log file :

    Step 124/126 : RUN make -j '1' ---> Running in bfef425bf6ae Scanning dependencies of target bistro_lib [ 0%] Building CXX object CMakeFiles/bistro_lib.dir/Bistro.cpp.o [ 0%] Linking CXX static library libbistro_lib.a [ 0%] Built target bistro_lib Scanning dependencies of target gtest [ 1%] Building CXX object cmake/deps/gtest-1.7.0/CMakeFiles/gtest.dir/src/gtest-all.cc.o [ 1%] Linking CXX static library libgtest.a [ 1%] Built target gtest Scanning dependencies of target gtest_main [ 1%] Building CXX object cmake/deps/gtest-1.7.0/CMakeFiles/gtest_main.dir/src/gtest_main.cc.o [ 2%] Linking CXX static library libgtest_main.a [ 2%] Built target gtest_main Scanning dependencies of target subprocess_with_timeout [ 2%] Building CXX object processes/CMakeFiles/subprocess_with_timeout.dir/SubprocessOutputWithTimeout.cpp.o [ 3%] Linking CXX static library libsubprocess_with_timeout.a [ 3%] Built target subprocess_with_timeout Scanning dependencies of target processes [ 4%] Building CXX object processes/CMakeFiles/processes.dir/AsyncCGroupReaper.cpp.o [ 4%] Building CXX object processes/CMakeFiles/processes.dir/AsyncReadPipeRateLimiter.cpp.o [ 4%] Building CXX object processes/CMakeFiles/processes.dir/CGroupSetup.cpp.o [ 5%] Building CXX object processes/CMakeFiles/processes.dir/TaskSubprocessQueue.cpp.o [ 5%] Linking CXX static library libprocesses.a [ 5%] Built target processes Scanning dependencies of target test_cgroup_setup [ 5%] Building CXX object processes/tests/CMakeFiles/test_cgroup_setup.dir/test_cgroup_setup.cpp.o [ 6%] Linking CXX executable test_cgroup_setup ^[[91mCMakeFiles/test_cgroup_setup.dir/test_cgroup_setup.cpp.o: In function TestCGroupSetup_TestSetup_Test::TestBody()::{lambda()#1}::operator()() const': /home/bistro/bistro/processes/tests/test_cgroup_setup.cpp:72: undefined reference tofacebook::bistro::cgroupSetup(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, facebook::bistro::cpp2::CGroupOptions const&)' CMakeFiles/test_cgroup_setup.dir/test_cgroup_setup.cpp.o: In function TestCGroupSetup_TestSetup_Test::TestBody()::{lambda()#4}::operator()() const': /home/bistro/bistro/processes/tests/test_cgroup_setup.cpp:93: undefined reference tofacebook::bistro::cgroupSetup(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, facebook::bistro::cpp2::CGroupOptions const&)' CMakeFiles/test_cgroup_setup.dir/test_cgroup_setup.cpp.o: In function TestCGroupSetup_TestSetup_Test::TestBody()::{lambda()#6}::operator()() const': /home/bistro/bistro/processes/tests/test_cgroup_setup.cpp:118: undefined reference tofacebook::bistro::cgroupSetup(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, facebook::bistro::cpp2::CGroupOptions const&)' CMakeFiles/test_cgroup_setup.dir/test_cgroup_setup.cpp.o: In function TestCGroupSetup_TestSetup_Test::TestBody()::{lambda()#7}::operator()() const': /home/bistro/bistro/processes/tests/test_cgroup_setup.cpp:139: undefined reference tofacebook:^[[0m^[[91m:bistro::cgroupSetup(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, facebook::bistro::cpp2::CGroupOptions const&)' CMakeFiles/test_cgroup_setup.dir/test_cgroup_setup.cpp.o: In function TestCGroupSetup_TestSetup_Test::TestBody()': /home/bistro/bistro/processes/tests/test_cgroup_setup.cpp:63: undefined reference tofacebook::bistro::cpp2::CGroupOptions::CGroupOptions()' /home/bistro/bistro/processes/tests/test_cgroup_setup.cpp:67: undefined reference to `facebook::bistro::cgroupSetup(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, facebook::bistro::cpp2::CGroupOptions const&)'

    opened by sumitjhaMindtickle 4
  • Sometimes scheduler gets error No initial worker set ID consensus

    Sometimes scheduler gets error No initial worker set ID consensus

    Sometime it gets error and does not run tasks: W0517 08:44:53.419775 11057 RemoteWorkerRunner.cpp:89] RemoteWorkerRunner initial wait (/home/user/src/bistro/bistro/runners/RemoteWorkerRunner.cpp:75): No initial worker set ID consensus. Waiting for all workers to connect before running tasks.

    It sometime seems to work more consistently if worker is started (fully) before scheduler(?)

    Scheduler startup:

    $HOME/src/bistro/bistro/cmake/Debug/server/bistro_scheduler \
      --server_port=6789 \
      --http_server_port=6790 \
      --config_file=/etc/bs/config.json \
      --clean_statuses \
      --CAUTION_startup_wait_for_workers=700 \
      --instance_node_name=scheduler
    

    Worker startup:

    $HOME/src/bistro/bistro/cmake/Debug/worker/bistro_worker \
      --server_port=27182 \
      --scheduler_host=:: \
      --scheduler_port=6789 \
      --worker_command="/etc/bs/default_task.sh" \
      --data_dir=/tmp/bistro_worker
    
    opened by ghost 4
  • Support Encrypted Traffic

    Support Encrypted Traffic

    It might just be the docs being in a fresh state but I can't see anything about encrypting inter-node traffic. It would be nice to be able to run Bistro without trusting the network.

    opened by kevincox 4
  • Example doesn't work with Docker-based build

    Example doesn't work with Docker-based build

    Hi, thank you for contributing this great tool to Github, I couldn't find any similar tool, not as simple at least.

    However, I have a problem with running example program on Docker build. TL;DR: when trying to connect worker with server, I have an error: 111 (Connection Refused). I've checked the port (with lsof -i) from worker's terminal and server indeed listens on 6789 on both ipv4 and ipv6.

    First, I've had some issues with build of "master" branch. IIRC some build script was using thrift1 command, instead of /home/install/bin/thrift1. I've looked at issue tracker and found this: https://github.com/facebook/bistro/issues/18 , and I used the commit pointed here (044cd9f...). It worked: build finished, even though some tests fail, but binaries were built and they work. For note, my command for making the Docker image: os_image=ubuntu:16.04 gcc_version=5 make_parallelism=2 travis_cache_dir=~/travis_ccache ./fbcode_builder/travis_docker_build.sh &> build_at_$(date +'%Y%m%d_%H%M%S').log

    Then I connected to my image (using instructions from https://github.com/facebook/bistro/blob/master/build/fbcode_builder/README.docker) and tried to run the example from here: https://github.com/facebook/bistro/blob/master/README.md#your-first-bistro-run. I'm running both exactly the same commands as in README, in directory /home/bistro/bistro, on the same docker session, using screen, and worker returns this error:

    W0928 12:12:55.081917   157 BistroWorkerHandler.cpp:666] Waiting for this worker to start listening on ServiceAddress {
      1: ip_or_host (string) = "172.17.0.2",
      2: port (i32) = 27182,
    }: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
    

    I was wondering that maybe there's something wrong with my Docker configuration? I've installed it using this guide: https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-on-ubuntu-16-04

    Worker log:

    [email protected]:/home/bistro/bistro# ./cmake/Debug/worker/bistro_worker --server_port=27182 --scheduler_host=:: \
    >   --scheduler_port=6789 --worker_command="$HOME/demo_bistro_task.sh" \
    >   --data_dir=/tmp/bistro_worker
    W0928 12:25:39.609571   215 server_socket.cpp:90] Found no 10 interfaces that are not link-local or loopback
    I0928 12:25:39.612613   215 LogWriter.cpp:79] Created table stderr
    I0928 12:25:39.612731   215 LogWriter.cpp:79] Created table stdout
    I0928 12:25:39.612826   215 LogWriter.cpp:79] Created table statuses
    I0928 12:25:39.613024   217 AutoTimer.h:142] Pruned logs with cutoff 1504009539 in 57.89 us
    I0928 12:25:40.873081   215 BistroWorkerHandler.cpp:102] Worker is ready: BistroWorker {
      1: shard (string) = "cc646d054226",
      2: machineLock (struct) = MachinePortLock {
        1: hostname (string) = "cc646d054226",
        2: port (i32) = 27182,
      },
      3: addr (struct) = ServiceAddress {
        1: ip_or_host (string) = "172.17.0.2",
        2: port (i32) = 27182,
      },
      4: id (struct) = BistroInstanceID {
        1: startTime (i64) = 1506601540,
        2: rand (i64) = -6770707008561318671,
      },
      5: heartbeatPeriodSec (i32) = 15,
      6: protocolVersion (i16) = 2,
      7: usableResources (struct) = UsablePhysicalResources {
        1: msSinceEpoch (i64) = 0,
        2: cpuCores (double) = 0,
        3: memoryMB (double) = 0,
        4: gpus (list) = list<struct>[0] {
        },
      },
    }
    W0928 12:25:40.892567   230 BistroWorkerHandler.cpp:666] Waiting for this worker to start listening on ServiceAddress {
      1: ip_or_host (string) = "172.17.0.2",
      2: port (i32) = 27182,
    }: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
    I0928 12:25:41.894337   246 AutoTimer.h:142] Query: 'SELECT job_id, node_id, time_and_count, line FROM statuses WHERE (time_and_count <= 0) ORDER BY time_and_count DESC LIMIT 2'; args: ' in 182 ns
    I0928 12:25:41.894436   246 LogWriter.cpp:220] Got 0 statuses lines
    E0928 12:25:41.895129   230 BistroWorkerHandler.cpp:754] Unable to send heartbeat to scheduler: Channel is !good()
    

    Scheduler log:

    # ./cmake/Debug/server/bistro_scheduler \
      --server_port=6789 --http_server_port=6790 \
      --config_file=scripts/test_configs/simple --clean_statuses \
      --CAUTION_startup_wait_for_workers=1 --instance_node_name=scheduler> > > 
    I0928 12:26:42.317178   255 AutoTimer.h:142] Read config from /home/bistro/bistro/scripts/test_configs/simple in 106.4 us
    I0928 12:26:42.317651   255 AutoTimer.h:142] Parsed config with 1 jobs in 352.2 us
    I0928 12:26:42.317860   255 AutoTimer.h:142] Have 7 nodes after manual in 62.42 us
    I0928 12:26:42.318045   258 Monitor.cpp:79] Updating monitor histogram (/home/bistro/bistro/monitor/Monitor.cpp:65): Monitor transiently not making a histogram for simple_job since it is not loaded
    W0928 12:26:42.318713   260 RemoteWorkerRunner.cpp:93] RemoteWorkerRunner initial wait (/home/bistro/bistro/runners/RemoteWorkerRunner.cpp:79): DANGER! DANGER! Your --CAUTION_startup_wait_for_workers of 1 is lower than the max healthcheck gap of 125, which makes it very likely that you will start second copies of tasks that are already running (unless your heartbeat interval is much smaller). No initial worker set ID consensus. Waiting for all workers to connect before running tasks.
    I0928 12:26:42.319443   261 Bistro.cpp:184] Idle wait...
    
    opened by Pand9 3
  • bistro_scheduler startup error Singleton N6wangle12_GLOBAL__N_113PollerContextE requested before registrationComplete() call

    bistro_scheduler startup error Singleton N6wangle12_GLOBAL__N_113PollerContextE requested before registrationComplete() call

    I got the docker build to run, but ctest had 4 tests failed, and bistro_scheduler got error: "wangle...PollerContextE requested before registrationComplete() call". What am I missing?

    Test failure: export os_image=ubuntu:16.04 export gcc_version=5 make_parallelism=2 ./build/fbcode_builder/travis_docker_build.sh

    $ docker run -it 1e47cff229f0 bash [email protected]:/home/bistro/bistro/cmake/Debug$ ctest Test project /home/bistro/bistro/cmake/Debug Start 1: test_async_read_pipe 1/56 Test #1: test_async_read_pipe .................. Passed 0.02 sec Start 2: test_async_read_pipe_rate_limiter ... 93% tests passed, 4 tests failed out of 56

    Total Test time (real) = 38.80 sec

    The following tests FAILED: 11 - test_worker (OTHER_FAULT) 19 - test_thrift_monitor (OTHER_FAULT) 28 - test_scheduler (OTHER_FAULT) 51 - test_remote_runner (OTHER_FAULT) Errors while running CTest


    bistro_scheduler startup error.

    [email protected]:/home/bistro/bistro# ./cmake/Debug/server/bistro_scheduler \

    --server_port=6789 --http_server_port=6790
    --config_file=scripts/test_configs/simple --clean_statuses
    --CAUTION_startup_wait_for_workers=1 --instance_node_name=scheduler I0406 14:21:51.122525 37 AutoTimer.h:142] Read config from /home/bistro/bistro/scripts/test_configs/simple in 89.35 us I0406 14:21:51.122921 37 AutoTimer.h:142] Parsed config with 1 jobs in 275.8 us I0406 14:21:51.123087 37 AutoTimer.h:142] Have 7 nodes after manual in 48.02 us I0406 14:21:51.123237 40 Monitor.cpp:74] Updating monitor histogram (/home/bistro/bistro/monitor/Monitor.cpp:60): Monitor transiently not making a histogram for simple_job since it is not loaded W0406 14:21:51.124105 42 RemoteWorkerRunner.cpp:89] RemoteWorkerRunner initial wait (/home/bistro/bistro/runners/RemoteWorkerRunner.cpp:75): DANGER! DANGER! Your --CAUTION_startup_wait_for_workers of 1 is lower than the max healthcheck gap of 125, which makes it very likely that you will start second copies of tasks that are already running (unless your heartbeat interval is much smaller). No initial worker set ID consensus. Waiting for all workers to connect before running tasks. I0406 14:21:51.124487 43 Bistro.cpp:184] Idle wait... I0406 14:21:51.126633 37 HTTPMonitorServer.cpp:130] Launched HTTP Monitor Server on port 6790, result 0 F0406 14:21:51.127137 37 Singleton-inl.h:241] Singleton N6wangle12_GLOBAL__N_113PollerContextE requested before registrationComplete() call. *** Check failure stack trace: *** @ 0x7f3873f8e5cd google::LogMessage::Fail() @ 0x7f3873f90433 google::LogMessage::SendToLog() @ 0x7f3873f8e15b google::LogMessage::Flush() @ 0x7f3873f8e379 google::LogMessage::~LogMessage() @ 0x7f38713e7ba2 folly::detail::SingletonHolder<>::createInstance() @ 0x7f38713e6b9c folly::detail::SingletonHolder<>::try_get() @ 0x7f38713e64d7 folly::Singleton<>::try_get() @ 0x7f38713e5790 wangle::FilePoller::init() @ 0x7f38713e5654 wangle::FilePoller::FilePoller() @ 0x7f3871fc3ccd ZSt11make_uniqueIN6wangle10FilePollerEJRKNSt6chrono8durationIlSt5ratioILl1ELl1EEEEEENSt9_MakeUniqIT_E15__single_objectEDpOT0 @ 0x7f3871fc2b4d apache::thrift::SecurityKillSwitchPoller::SecurityKillSwitchPoller() @ 0x7f3871fc2a89 apache::thrift::SecurityKillSwitchPoller::SecurityKillSwitchPoller() @ 0x7f3871fe4dcd apache::thrift::ThriftServer::ThriftServer() @ 0x7f3871fe49ef apache::thrift::ThriftServer::ThriftServer() @ 0xc796e0 ZN9__gnu_cxx13new_allocatorIN6apache6thrift12ThriftServerEE9constructIS3_JEEEvPT_DpOT0 @ 0xc78c37 ZNSt16allocator_traitsISaIN6apache6thrift12ThriftServerEEE9constructIS2_JEEEvRS3_PT_DpOT0 @ 0xc781d4 std::_Sp_counted_ptr_inplace<>::_Sp_counted_ptr_inplace<>() @ 0xc7719d ZNSt14__shared_countILN9__gnu_cxx12_Lock_policyE2EEC2IN6apache6thrift12ThriftServerESaIS6_EJEEESt19_Sp_make_shared_tagPT_RKT0_DpOT1 @ 0xc7629e std::__shared_ptr<>::__shared_ptr<>() @ 0xc75676 ZNSt10shared_ptrIN6apache6thrift12ThriftServerEEC2ISaIS2_EJEEESt19_Sp_make_shared_tagRKT_DpOT0 @ 0xc74462 std::allocate_shared<>() @ 0xc73004 ZSt11make_sharedIN6apache6thrift12ThriftServerEJEESt10shared_ptrIT_EDpOT0 @ 0xc7004e main @ 0x7f387019f830 __libc_start_main @ 0xc6f6c9 _start @ (nil) (unknown) Aborted (core dumped)

    opened by ghost 3
  • Can't build on fresh Ubuntu 12.04.5

    Can't build on fresh Ubuntu 12.04.5

    Building from the docker image for Ubuntu 12.04.5 fails when attempting to use cmake as the build script installs a version that's too low:

    + sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.8 50
    update-alternatives: using /usr/bin/gcc-4.8 to provide /usr/bin/gcc (gcc) in auto mode.
    + sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-4.8 50
    update-alternatives: using /usr/bin/g++-4.8 to provide /usr/bin/g++ (g++) in auto mode.
    + CMAKE_NAME=cmake-2.8.12.1
    + GFLAGS_VER=2.1.1
    + GLOG_NAME=glog-0.3.3
    + pushd .
    + git clone https://github.com/google/double-conversion
    /bistro/bistro/build/deps/fbthrift/thrift/build/deps/folly/folly/build/deps /bistro/bistro/build/deps/fbthrift/thrift/build/deps/folly/folly/build/deps
    Cloning into 'double-conversion'...
    + cd double-conversion
    + cmake -DBUILD_SHARED_LIBS=ON .
    CMake Error at CMakeLists.txt:1 (cmake_minimum_required):
      CMake 2.8.12 or higher is required.  You are running version 2.8.7
    

    I attempted to build using 14.04.4 and the build script gets further and fails trying to fetch from a PPA:

    W: Failed to fetch http://ppa.launchpad.net/boost-latest/ppa/ubuntu/dists/trusty/main/binary-amd64/Packages 404 Not Found

    What am I missing here?

    opened by bloo 3
  • How to optimize Docker image size?

    How to optimize Docker image size?

    Hi, i was finally able to build a Docker image with Bistro, but i am a bit worried about it's enormous size. It has roughly 5.2gb.

    Do you have any tips how to reduce it's size?

    It is automatically generated Dockerfile using fbcode_builder.

    Basically it is repeating blocks of download+build+install blocks:

    ### Check out fmtlib/fmt, workdir build ###
    
    USER root
    RUN mkdir -p '/home' && chown 'nobody' '/home'
    USER 'nobody'
    WORKDIR '/home'
    RUN git clone  https://github.com/'fmtlib/fmt'
    USER root
    RUN mkdir -p '/home'/'fmt'/'build' && chown 'nobody' '/home'/'fmt'/'build'
    USER 'nobody'
    WORKDIR '/home'/'fmt'/'build'
    RUN git checkout '6.2.1'
    
    ### Build and install fmtlib/fmt ###
    
    RUN CXXFLAGS="$CXXFLAGS -fPIC -isystem "'/home/install'"/include" CFLAGS="$CFLAGS -fPIC -isystem "'/home/install'"/include" cmake -D'CMAKE_INSTALL_PREFIX'='/home/install' -D'BUILD_SHARED_LIBS'='ON' '..'
    RUN make -j '4' VERBOSE=1 
    RUN make install VERBOSE=1 
    

    I was thinking if i can somehow remove cache. Maybe just rm -rf /fmt (same for every other cloned repository) after package is installed could help to reduce size.

    As well i do not usually use c++ so i do not know how it really works internally, please if i am mistaken and my idea is stupid, just correct me 😄 if we could take only the final binaries and extract them to different, clean, docker image?

    Other idea was using some alpine based linux or other base image than ubuntu (quick googling brought me to https://github.com/madduci/docker-cpp-env).

    Can anything of this work or would you suggest anything completely different?

    I was thinking about having autoscaling mechanism for bistro workers etc on aws spot instances (maybe even as lambdas) and for these purposes i wanted to have image as thin as possible.

    opened by JanMikes 5
  • Adding Code of Conduct file

    Adding Code of Conduct file

    This is pull request was created automatically because we noticed your project was missing a Code of Conduct file.

    Code of Conduct files facilitate respectful and constructive communities by establishing expected behaviors for project contributors.

    This PR was crafted with love by Facebook's Open Source Team.

    CLA Signed 
    opened by facebook-github-bot 0
  • Add Code of Conduct

    Add Code of Conduct

    In the past Facebook didn't promote including a Code of Conduct when creating new projects, and many projects skipped this important document. Let's fix it. :)

    why make this change?: Facebook Open Source provides a Code of Conduct statement for all projects to follow, to promote a welcoming and safe open source community.

    Exposing the COC via a separate markdown file is a standard being promoted by Github via the Community Profile in order to meet their Open Source Guide's recommended community standards.

    As you can see, adding this file will improve the bistro community profile checklist and increase the visibility of our COC.

    test plan: Viewing it on my branch - screen shot 2017-12-03 at 4 20 31 pm screen shot 2017-12-03 at 4 20 38 pm

    issue: internal task t23481323

    CLA Signed 
    opened by flarnie 0
  • CLI tools are still missing

    CLI tools are still missing

    "Although Bistro has been in production at Facebook for over 3 years, the present public release is partial, including just the server components. The CLI tools and web UI will be shipping shortly."

    It would be great to see these.

    opened by brett-miller 11
Owner
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
Facebook
OOX: Out-of-Order Executor library. Yet another approach to efficient and scalable tasking API and task scheduling.

OOX Out-of-Order Executor library. Yet another approach to efficient and scalable tasking API and task scheduling. Try it Requirements: Install cmake,

Intel Corporation 17 Mar 10, 2022
Operating system project - implementing scheduling algorithms and some system calls for XV6 OS

About XV6 xv6 is a modern reimplementation of Sixth Edition Unix in ANSI C for multiprocessor x86 and RISC-V systems. It was created for pedagogical p

Amirhossein Rajabpour 20 May 19, 2022
Px - Single header C++ Libraries for Thread Scheduling, Rendering, and so on...

px 'PpluX' Single header C++(11/14) Libraries Name Code Description px_sched px_sched.h Task oriented scheduler. See more px_render px_render.h Multit

PpluX 440 Jun 30, 2022
A task scheduling framework designed for the needs of game developers.

Intel Games Task Scheduler (GTS) To the documentation. Introduction GTS is a C++ task scheduling framework for multi-processor platforms. It is design

null 418 Jul 25, 2022
Scheduler - Modern C++ Scheduling Library

Scheduler Modern C++ Header-Only Scheduling Library. Tasks run in thread pool. Requires C++11 and ctpl_stl.h in the path. Inspired by the Rufus-Schedu

Spencer Bosma 213 Jul 12, 2022
Modern concurrency for C++. Tasks, executors, timers and C++20 coroutines to rule them all

concurrencpp, the C++ concurrency library concurrencpp is a tasking library for C++ allowing developers to write highly concurrent applications easily

David Haim 1k Aug 10, 2022
Smart queue that executes tasks in threadpool-like manner

execq execq is kind of task-based approach of processing data using threadpool idea with extended features. It supports different task sources and mai

Vladimir (Alkenso) 32 May 24, 2022
Partr - Parallel Tasks Runtime

Parallel Tasks Runtime A parallel task execution runtime that uses parallel depth-first (PDF) scheduling [1]. [1] Shimin Chen, Phillip B. Gibbons, Mic

null 32 Jul 17, 2022
Simple example for running code on VPU from Linux

VPU-example Simple example for running code on VPU from Linux Toggling GPIO2 on a Raspberry Pi, see code.asm Based on https://github.com/ali1234/vcpok

null 18 Aug 2, 2022
An ultra-simple thread pool implementation for running void() functions in multiple worker threads

void_thread_pool.cpp © 2021 Dr Sebastien Sikora. [email protected] Updated 06/11/2021. What is it? void_thread_pool.cpp is an ultra-simple

Seb Sikora 1 Nov 19, 2021
Simple and fast C library implementing a thread-safe API to manage hash-tables, linked lists, lock-free ring buffers and queues

libhl C library implementing a set of APIs to efficiently manage some basic data structures such as : hashtables, linked lists, queues, trees, ringbuf

Andrea Guzzo 387 Jul 30, 2022
Parallel-hashmap - A family of header-only, very fast and memory-friendly hashmap and btree containers.

The Parallel Hashmap Overview This repository aims to provide a set of excellent hash map implementations, as well as a btree alternative to std::map

Gregory Popovitch 1.5k Aug 9, 2022
🧵 Fast and easy multithreading for React Native using JSI

react-native-multithreading ?? Fast and easy multithreading for React Native using JSI. Installation npm install react-native-multithreading npx pod-i

Marc Rousavy 930 Jul 30, 2022
A fast multi-producer, multi-consumer lock-free concurrent queue for C++11

moodycamel::ConcurrentQueue An industrial-strength lock-free queue for C++. Note: If all you need is a single-producer, single-consumer queue, I have

Cameron 7k Aug 6, 2022
A fast single-producer, single-consumer lock-free queue for C++

A single-producer, single-consumer lock-free queue for C++ This mini-repository has my very own implementation of a lock-free queue (that I designed f

Cameron 2.7k Aug 5, 2022
Fast, generalized, implementation of the Chase-Lev lock-free work-stealing deque for C++17

riften::Deque A bleeding-edge lock-free, single-producer multi-consumer, Chase-Lev work stealing deque as presented in the paper "Dynamic Circular Wor

Conor Williams 115 Jul 14, 2022
Light, fast, threadpool for C++20

riften::Thiefpool A blazing-fast, lightweight, work-stealing thread-pool for C++20. Built on the lock-free concurrent riften::Deque. Usage #include "r

Conor Williams 61 Jul 6, 2022
lc is a fast multi-threaded line counter.

Fast multi-threaded line counter in Modern C++ (2-10x faster than `wc -l` for large files)

Pranav 13 Jul 28, 2022
Termite-jobs - Fast, multiplatform fiber based job dispatcher based on Naughty Dogs' GDC2015 talk.

NOTE This library is obsolete and may contain bugs. For maintained version checkout sx library. until I rip it from there and make a proper single-hea

Sepehr Taghdisian 35 Jan 9, 2022