Phoebe

Overview

Phoeβe

CI

Idea

Phoeβe (/ˈfiːbi/) wants to add basic artificial intelligence capabilities to the Linux OS.

What problem Phoeβe wants to solve

System-level tuning is a very complex activity, requiring the knowledge and expertise of several (all?) layers which compose the system itself, how they interact with each other and (quite often) it is required to also have an intimate knowledge of the implementation of the various layers.

Another big aspect of running systems is dealing with failure. Do not think of failure as a machine turning on fire rather as an overloaded system, caused by misconfiguration, which could lead to starvation of the available resources.

In many circumstances, operators are used to deal with telemetry, live charts, alerts, etc. which could help them identifying the offending machine(s) and (re)act to fix any potential issues.

However, one question comes to mind: wouldn't it be awesome if the machine could auto-tune itself and provide a self-healing capability to the user? Well, if that is enough to trigger your interest then this is what Phoeβe aims to provide.

Phoeβe uses system telemetry as the input to its brain and produces a big set of settings which get applied to the running system. The decision made by the brain is continuously reevaluated (considering the grace_period setting) to offer eventually the best possible setup.

Architecture

Phoeβe is designed with a plugin architecture in mind, providing an interface for new functionality to be added with ease.

Plugins are loaded at runtime and registered with the main body of execution. The only requirement is to implement the interface dictated by the structure plugin_t. The network_plugin.c represents a very good example of how to implement a new plugin for Phoeβe.

Disclaimer

The mathematical model implemented is a super-basic one, which implements a machine-learning 101 approach: input * weight + bias. It does not use any fancy techniques and the complexity is close to zero.

The plan is to eventually migrate towards a model created in Tensorflow and exported so to be used by Phoeβe, but we are not there yet.

10,000 feet view

The code allows for both training and inference: — all the knobs which can modify the run-time behavior of the implementation are configurable via the settings.json file, where each parameter is explained in detail.

For the inference case, when a match is found, then the identified kernel parameters are configured accordingly.

The inference loop runs every N seconds and the value is configurable via the inference_loop_period. Depending on how quick we want the system to react to a situation change, then the value given to the inference_loop_period will be bigger or smaller.

The code has a dedicated stats collection thread which periodically collects system statistics and populates structures used by the inference loop. The statistics are collected every N seconds, and this value is configurable via the stats_collection_period. Depending on the overall network demands, the value of stats_collection_period will be bigger or smaller to react slower or quicker to network events.

In case a high traffic rate is seen on the network and a matching entry is found, then the code will not consider any lower values for a certain period of time: the value is configurable via the grace_period in the settings.json file.

That behavior has been implemented to avoid causing too much reconfiguration on the system and to prevent sudden system reconfiguration due to network spikes.

The code also supports few approximation functions, also available via the settings.json file.

The approximation functions can tune the tolerance value - runtime calculated - to further allow the user for fine tuning of the matching criteria. Depending on the approximation function, obviously, the matching criteria could be narrower or broader.

Settings

Below is a detailed an explanation of what configurations are available in settings.json, the possible values and what effect they have. (Note that this is not really valid JSON; please remove the lines with double-forward-slashes if you use it.)

{
    "app_settings": {

        // path where application is expecting to find plugins to load
        "plugins_path": "/home/mvarlese/REPOS/ai-poc/bin",

        // max_learning_values: number of values learnt per iteration
        "max_learning_values": 1000,

        // save trained data to file every saving_loop value
        "saving_loop": 10,

        // accuracy: the level of accuracy to find a potential entry
        // given the transfer rate considered.
        //
        // MaxValue: Undefined, MinValue: 0.00..1
        // Probably not very intuitive: a higher number correspondes to
        // a higher accuracy level.
        "accuracy": 0.5,

        // approx_function: the approximation function applied
        // to the calculated tolerance value used to find a
        // matching entry in values.
        //
        // Possible values:
        // 0 = no approximation function
        // 1 = square-root
        // 2 = power-of-two
        // 3 = log10
        // 4 = log
        "approx_function": 0,

        // grace_period: the time which must be elapsed
        // before applying new settings for a lower
        // transfer rate than the one previously measured.
        "grace_period": 10,

        // stats_loop_period: the cadence of time which
        // has to be elapsed between stats collection.
        // It is expressed in seconds but it accepts non-integer
        // values; ie. 0.5 represents half-second
        "stats_collection_period": 0.5,

        // inferece_loop_period: the time which must be
        // elapsed before running a new inference evaluation
        "inference_loop_period": 1

    },

    "weights":{
        "transfer_rate_weight": 0.8,
        "drop_rate_weight" : 0.1,
        "errors_rate_weight" : 0.05,
        "fifo_errors_rate_weight" : 0.05
    },

    "bias": 10
}

Building

The PoC code is built using Meson:

$ meson build
$ cd build/
$ meson compile

You can also run debug builds using address or undefined behavior sanitizer:

$ meson build -Db_sanitize=undefined # or -Db_sanitize=address for ASAN
$ cd build/
$ meson compile

There are few compile-time flags which can be passed to Meson to enable some code behavior:

  • print_messages: used to print to stdout only the most important messages (this is the only parameter enabled by default)
  • print_advanced_messages: used for very verbose printing to stdout (useful for debugging purposes)
  • print_table: used to print to stdout all data stored in the different tables maintained by the application
  • apply_changes: this enables the application to actually apply the settings via sysctl/ethtool command
  • check_initial_settings: when enabled, this will prevent the application from applying lower settings than the ones already applied to the system at bootstrap
  • m_threads: when enabled, this will run training using as many threads as available cores on the machine

These flags can be enabled by passing them to the Meson configure step:

$ meson -Dprint_advanced_messages=true -Dprint_table=true build

Running

The code supports multiple mode of operation:

  • Training mode:
./build/src/phoebe -f ./csv_files/rates_trained_data.csv -m training -s settings.json
  • Inference
./build/src/phoebe -f ./csv_files/rates_trained_data.csv -i wlan0 -m inference -s settings.json

Feedback / Input / Collaboration

If you are curious about the project and want more information, please, do reach out to [email protected].
I will be more than happy to talk to you more about this project and what other initiatives are in this area.

Comments
  • meson - room for improvements

    meson - room for improvements

    1. scripts - folder copied empty. Expected : needs to contain python script
    2. attempt to run scripts/collect_stats.py end up with Module not found "_phobe" Expected: meson script suppose to install need python module
    opened by asmorodskyi 12
  • Refactors and workarounds so testing of collect_stats.py can be enabled in CI

    Refactors and workarounds so testing of collect_stats.py can be enabled in CI

    This pull requests contains a series of commits to refactor/workaround several blockers that prevented us from testing collect_stat.py in CI.

    • Remove access to sysfs CPU device in collect_stats.py
    • Disable ASAN leak detection when testing collect_stats.py
    • Give the libasan in LD_PRELOAD when testing collect_stats.py
    opened by shunghsiyu 8
  • Disable building of shared library needed by collect_stats.py by default

    Disable building of shared library needed by collect_stats.py by default

    The _phoebe.abi3.so shared library is used by collect_stats.py to retrieve various system metric; and the shared library itself requires python3-cffi to generate the binding (_phoebe.c).

    On Linux distribution that doesn't come with a python3-cffi package is becomes a problem, while there are other ways to install python3-cffi (e.g. pip3 install cffi), doing so it not ideal when building the phoebe package itself.

    Thus the easiest way to deal with the problem is disable building of _phoebe.abi3.so by default so packaging doesn't need python3-cffi, and explicitly enable it in our CI.

    opened by shunghsiyu 7
  • Add detailed installation instructions

    Add detailed installation instructions

    I tried setting up phoebe on a new Tumbleweed VM today and I had a couple of issues with missing dependencies:

    1. libnl-3.0 was missing
    2. json-c was missing
    3. I had a couple of issues with cmocka. After building from source, I tried building phoebe and ran into linker issues. Installing RPMs from rpmfind fixed the issue. Could somebody confirm if this is the expected way to fulfil the cmocka dependency, I haven't used it before and I was wondering if I was missing something in the process?

    It'd be great to have a bit more detailed instructions for building the repository. I could create an INSTALL.md file to add more instructions to build on Tumbleweed to begin with

    opened by mukul-mehta 6
  • Add unit tests

    Add unit tests

    This is a draft pull request to add unit tests leveraging the cmocka library for the allocateMemoryBasedOnInputAndMaxLearningValues function.

    Currently, the CI fails because allocateMemoryBasedOnInputAndMaxLearningValues has a bug: if the file from which it tries to read the number of lines has a really long line, then it reports a wrong number of lines. Also, it counts one line less than there actually are in the file. Is that intentional?

    opened by dcermak 5
  • collect_stats tries to access sysfs

    collect_stats tries to access sysfs

    The collect_stats.py script tries to query the current cpu frequency and governor from sysfs: https://github.com/SUSE/phoebe/blob/7e269fc1e27430d75c0a13b2739f1b49acfcd07f/scripts/collect_stats.py#L187

    Unfortunately, this fails in the github actions with:

    Traceback (most recent call last):
      File "/__w/phoebe/phoebe/scripts/collect_stats.py", line 306, in <module>
        main(sys.argv[1], settings, count)
      File "/__w/phoebe/phoebe/scripts/collect_stats.py", line 271, in main
        collect_stats(
      File "/__w/phoebe/phoebe/scripts/collect_stats.py", line 187, in collect_stats
        with open(SYSFS_CPU_PATH + 'cpu0/cpufreq/scaling_governor') as f:
    FileNotFoundError: [Errno 2] No such file or directory: '/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'
    

    I suspect that this is caused by the github actions runner not allowing the CI action to query these information to prevent it from modifying the CPU behavior.

    @shunghsiyu I was told you introduced this, can we remove it for the meantime?

    opened by dcermak 5
  • Remove SETUP.md

    Remove SETUP.md

    SETUP.md contains information about our Lab setup. Unforunately, the setup is only accessible to SUSE employee.

    While there is nothing secretive about our lab setup, it does not make much sense for this to live directly inside the repository now that the repository itself is public.

    The original content has been moved to https://github.com/SUSE/phoebe/wiki/Lab-Setup instead.

    opened by shunghsiyu 4
  • Disable collector in phoebe.spec

    Disable collector in phoebe.spec

    This is based on top of @asmorodskyi work in #45.

    I removed Python-related dependencies and disabled the collector in phoebe.spec. Hopefully this is all that's left to do to get the CI happy.

    opened by shunghsiyu 3
  • Improve CI

    Improve CI

    This PR includes the following changes:

    • output the testlog directly, so that one sees the failure reason in github actions and doesn't have to download the artifacts
    • increase the verbosity of mock when building rpms
    • add cache of the mock root, this should speed up rpmbuilds considerably
    opened by dcermak 2
  • Improve spec file & docs

    Improve spec file & docs

    • don't use suse constructs in non-suse distros
    • remove hard dependency on systemd
    • document build options and enable the collector by default
    • initiated by #38
    opened by dcermak 1
  • Add quiet and verbose option

    Add quiet and verbose option

    Added a global verbosity level which refers to a static unsigned int verbosity_level in utiltity.c. The variable has to be placed here, so that it only exists once in the binaires. For the plugin it has to be set after the initialization.

    opened by mslacken 1
  • Towards a proof-of-concept for a wider audience

    Towards a proof-of-concept for a wider audience

    I believe this project can gain more traction if we can showcase the project to a wider audience that adding artificial intelligence capabilities to Linux OS works, i.e. auto-tuning does indeed yield better performance. (I'm leaving out the self-healing part for now, as it seems harder than auto-tuning).

    What I mean by wider audience is for someone with little knowledge of system and artificial intelligence to be able setup a scenario (or a benchmark), run the project and easily observe that the performance improved when auto-tuning is in action (ideally with all that done by a single script). To achieve that, there are still quite some challenges ahead.

    First off, the scenario should be easy to setup. Right now we use TREX in our target scenario, which is not the easiest to setup; while its performance is superior, it support much less network interface cards compared to the Linux kernel. This is easily solvable to switching to other packet generators (e.g. iperf3, sockperf, ab, etc.), and is a minor issue compared to the next one.

    Now, addressing the elephant in the room.

    The core of Phoeβe lies in its brain, the decision making engine that will take system telemetry as input, and output a set of system-level parameters that improves the system's performance when applied.

    But so far we have not been able to prove that this most curcial piece of the project works, that is, show that it can output system settings that does improve the system's performance. This is our second (albeit the major) issue we have that prevented us from showcasing the project to a wider audience.


    I hope this proposal make sense, and if so, perhaps we can proceed to a discussion on how can we improve the decision engine (more data points for csv_files/rates.csv? collect more metric for the decision engine?) and have a simpler setup.

    opened by shunghsiyu 1
Releases(v0.1.2)
Owner
SUSE
SUSE