Netdata's distributed, real-time monitoring Agent collects thousands of metrics from systems, hardware, containers, and applications with zero configuration.

Overview

Netdata

Netdata is high-fidelity infrastructure monitoring and troubleshooting.
Open-source, free, preconfigured, opinionated, and always real-time.


Latest release Build status CII Best Practices License: GPL v3+ analytics
Code Climate Codacy LGTM C LGTM PYTHON

---

Netdata's distributed, real-time monitoring Agent collects thousands of metrics from systems, hardware, containers, and applications with zero configuration. It runs permanently on all your physical/virtual servers, containers, cloud deployments, and edge/IoT devices, and is perfectly safe to install on your systems mid-incident without any preparation.

You can install Netdata on most Linux distributions (Ubuntu, Debian, CentOS, and more), container platforms (Kubernetes clusters, Docker), and many other operating systems (FreeBSD, macOS). No sudo required.

Netdata is designed by system administrators, DevOps engineers, and developers to collect everything, help you visualize metrics, troubleshoot complex performance problems, and make data interoperable with the rest of your monitoring stack.

People get addicted to Netdata. Once you use it on your systems, there's no going back! You've been warned...

Users who are addicted to Netdata

Latest release: v1.30.0, March 31, 2021

The v1.30.0 release of Netdata brings major improvements to our packaging and completely replaces Google Analytics/GTM for product telemetry. We're also releasing the first changes in an upcoming overhaul to both our dashboard UI/UX and the suite of preconfigured alarms that comes with every installation.

Menu

Features

Netdata in action

Here's what you can expect from Netdata:

  • 1s granularity: The highest possible resolution for all metrics.
  • Unlimited metrics: Netdata collects all the available metrics—the more, the better.
  • 1% CPU utilization of a single core: It's unbelievably optimized.
  • A few MB of RAM: The highly-efficient database engine stores per-second metrics in RAM and then "spills" historical metrics to disk long-term storage.
  • Minimal disk I/O: While running, Netdata only writes historical metrics and reads error and access logs.
  • Zero configuration: Netdata auto-detects everything, and can collect up to 10,000 metrics per server out of the box.
  • Zero maintenance: You just run it. Netdata does the rest.
  • Stunningly fast, interactive visualizations: The dashboard responds to queries in less than 1ms per metric to synchronize charts as you pan through time, zoom in on anomalies, and more.
  • Visual anomaly detection: Our UI/UX emphasizes the relationships between charts to help you detect the root cause of anomalies.
  • Scales to infinity: You can install it on all your servers, containers, VMs, and IoT devices. Metrics are not centralized by default, so there is no limit.
  • Several operating modes: Autonomous host monitoring (the default), headless data collector, forwarding proxy, store and forward proxy, central multi-host monitoring, in all possible configurations. Use different metrics retention policies per node and run with or without health monitoring.

Netdata works with tons of applications, notifications platforms, and other time-series databases:

  • 300+ system, container, and application endpoints: Collectors autodetect metrics from default endpoints and immediately visualize them into meaningful charts designed for troubleshooting. See everything we support.
  • 20+ notification platforms: Netdata's health watchdog sends warning and critical alarms to your favorite platform to inform you of anomalies just seconds after they affect your node.
  • 30+ external time-series databases: Export resampled metrics as they're collected to other local- and Cloud-based databases for best-in-class interoperability.

💡 Want to leverage the monitoring power of Netdata across entire infrastructure? View metrics from any number of distributed nodes in a single interface and unlock even more features with Netdata Cloud.

Get Netdata

User base Servers monitored Sessions served Docker Hub pulls
New users today New machines today Sessions today Docker Hub pulls today

To install Netdata from source on most Linux systems (physical, virtual, container, IoT, edge), run our one-line installation script. This script downloads and builds all dependencies, including those required to connect to Netdata Cloud if you choose, and enables automatic nightly updates and anonymous statistics.

bash <(curl -Ss https://my-netdata.io/kickstart.sh)

To view the Netdata dashboard, navigate to http://localhost:19999, or http://NODE:19999.

Docker

You can also try out Netdata's capabilities in a Docker container:

docker run -d --name=netdata \
  -p 19999:19999 \
  -v netdataconfig:/etc/netdata \
  -v netdatalib:/var/lib/netdata \
  -v netdatacache:/var/cache/netdata \
  -v /etc/passwd:/host/etc/passwd:ro \
  -v /etc/group:/host/etc/group:ro \
  -v /proc:/host/proc:ro \
  -v /sys:/host/sys:ro \
  -v /etc/os-release:/host/etc/os-release:ro \
  --restart unless-stopped \
  --cap-add SYS_PTRACE \
  --security-opt apparmor=unconfined \
  netdata/netdata

To view the Netdata dashboard, navigate to http://localhost:19999, or http://NODE:19999.

Other operating systems

See our documentation for additional operating systems, including Kubernetes, .deb/.rpm packages, and more.

Post-installation

When you're finished with installation, check out our single-node or infrastructure monitoring quickstart guides based on your use case.

Or, skip straight to configuring the Netdata Agent.

Read through Netdata's documentation, which is structured based on actions and solutions, to enable features like health monitoring, alarm notifications, long-term metrics storage, exporting to external databases, and more.

How it works

Netdata is a highly efficient, highly modular, metrics management engine. Its lockless design makes it ideal for concurrent operations on the metrics.

Diagram of Netdata's core functionality

The result is a highly efficient, low-latency system, supporting multiple readers and one writer on each metric.

Infographic

This is a high-level overview of Netdata features and architecture. Click on it to view an interactive version, which has links to our documentation.

An infographic of how Netdata works

Documentation

Netdata's documentation is available at Netdata Learn.

This site also hosts a number of guides to help newer users better understand how to collect metrics, troubleshoot via charts, export to external databases, and more.

Community

Netdata is an inclusive open-source project and community. Please read our Code of Conduct.

Find most of the Netdata team in our community forums. It's the best place to ask questions, find resources, and engage with passionate professionals.

You can also find Netdata on:

Contribute

Contributions are the lifeblood of open-source projects. While we continue to invest in and improve Netdata, we need help to democratize monitoring!

  • Read our Contributing Guide, which contains all the information you need to contribute to Netdata, such as improving our documentation, engaging in the community, and developing new features. We've made it as frictionless as possible, but if you need help, just ping us on our community forums!
  • We have a whole category dedicated to contributing and extending Netdata on our community forums
  • Found a bug? Open a GitHub issue.
  • View our Security Policy.

Package maintainers should read the guide on building Netdata from source for instructions on building each Netdata component from source and preparing a package.

License

The Netdata Agent is GPLv3+. Netdata re-distributes other open-source tools and libraries. Please check the third party licenses.

Is it any good?

Yes.

When people first hear about a new product, they frequently ask if it is any good. A Hacker News user remarked:

Note to self: Starting immediately, all raganwald projects will have a “Is it any good?” section in the readme, and the answer shall be “yes.".

Comments
  • what our users say about netdata?

    what our users say about netdata?

    In this thread we collect interesting (or funny, or just plain) posts, blogs, reviews, articles, etc - about netdata.

    1. don't start discussions on this post
    2. if you want to post, post the link to the original post and a screenshot!
    help wanted 
    opened by ktsaou 116
  • Prometheus Support

    Prometheus Support

    Hey guys,

    I recently started using prometheus and I enjoy the simplicity. I want to begin to understand what it would take to implement prometheus support within Netdata. I think this is a great idea because it allows the distributed fashion of netdata to exist along with having persistence at prometheus. Centralized graphing (not monitoring) can now happen with grafana. Netdata is a treasure trove of metrics already - making this a worth wild project.

    Prometheus expects a rest end point to exist which publishes a metric, labels, and values. It will poll this endpoint at a desired time frame and ingest the metrics during that poll.

    To get the ball rolling, how are you currently serving http in Netdata? Is this an embedded sockets server in C ?

    opened by ldelossa 108
  • python.d enhancements

    python.d enhancements

    @paulfantom I am writing here a TODO list for python.d based on my findings.

    • [x] DOCUMENTATION in wiki.

    • [x] log flood protection - it will require 2 parameters: logs_per_interval = 200 and log_interval = 3600. So, every hour (this_hour = int(now / log_interval)) it should reset the counter and allow up to logs_per_interval log entries until the next hour.

      This is how netdata does it: https://github.com/firehol/netdata/blob/d7b083430de1d39d0196b82035162b4483c08a3c/src/log.c#L33-L107

    • [x] support ipv6 for SocketService (currently redis and squid)

    • [x] netdata passes the environment variable NETDATA_HOST_PREFIX. cpufreq should use this to prefix sys_dir automatically. This variable is used when netdata runs in a container. The system directories /proc, /sys of the host should be exposed with this prefix.

    • [ ] the URLService should somehow support proxy configuration.

    • [ ] the URLService should support Connection: keep-alive.

    • [x] The service that runs external commands should be more descriptive. Example running exim plugin when exim is not installed:

      python.d ERROR: exim_local exim [Errno 2] No such file or directory
      python.d ERROR: exim_local exim [Errno 2] No such file or directory
      python.d ERROR: exim: is misbehaving. Reason:'NoneType' object has no attribute '__getitem__'
      
    • [x] This message should be a debug log No unix socket specified. Trying TCP/IP socket.

    • [x] This message could state where it tried to connect: [Errno 111] Connection refused

    • [x] This message could state the hostname it tried to resolve: [Errno -9] Address family for hostname not supported

    • [x] This should state the job name, not the name:

      python.d ERROR: redis/local: check() function reports failure.
      
    • [x] This should state with is the problem:

      # ./plugins.d/python.d.plugin debug cpufreq 1
      INFO: Using python v2
      python.d INFO: reading configuration file: /etc/netdata/python.d.conf
      python.d INFO: MODULES_DIR='/root/netdata/python.d/', CONFIG_DIR='/etc/netdata/', UPDATE_EVERY=1, ONLY_MODULES=['cpufreq']
      python.d DEBUG: cpufreq: loading module configuration: '/etc/netdata/python.d/cpufreq.conf'
      python.d DEBUG: cpufreq: reading configuration
      python.d DEBUG: cpufreq: job added
      python.d INFO: Disabled cpufreq/None
      python.d ERROR: cpufreq/None: check() function reports failure.
      python.d FATAL: no more jobs
      DISABLE
      
    • [x] ~~There should be a configuration entry in python.d.conf to set the PATH to be searched for commands. By default everything in /usr/sbin/ is not found.~~ Added #695 to do this at the netdata daemon for all its plugins.

    • [x] The default retries in the code, for all modules, is 5 or 10. I suggest to make them 60 for all modules. There are many services that cannot be restarted within 5 seconds.

      Made it in #695

    • [x] When a service reports failure to collect data (during update()), there should be log entry stating the reason of failure.

    • [x] Handling of incremental dimensions in LogService

    • [x] Better autodetection of disk count in hddtemp.chart.py

    • [ ] Move logging mechanism to utilize logging module.

    more to come...

    area/collectors collectors/python.d 
    opened by ktsaou 100
  • netdata package maintainers

    netdata package maintainers

    This issue has been converted to a wiki page

    For the latest info check it here: https://github.com/firehol/netdata/wiki/netdata-package-maintainers


    I think it would be useful to prepare a wiki page with information about the maintainers of netdata for the Linux distributions, automation systems, containers, etc.

    Let's see who is who:


    Official Linux Distributions

    | Linux Distribution | Netdata Version | Maintainer | Related URL | | :-: | :-: | :-: | :-- | | Arch Linux | Release | @svenstaro | netdata @ Arch Linux | | Arch Linux AUR | Git | @sanskritfritz | netdata @ AUR | | Gentoo Linux | Release + Git | @candrews | netdata @ gentoo | | Debian | Release | @lhw @FedericoCeratto | netdata @ debian | | Slackware | Release | @willysr | netdata @ slackbuilds | Ubuntu | | | | | Red Hat / Fedora / Centos | | | | | SuSe / openSuSe | | | |


    FreeBSD

    System|Initial PR|Core Developer|Package Maintainer |:-:|:-:|:-:|:-:| FreeBSD|#1321|@vlvkobal|@mmokhi


    MacOS

    System|URL|Core Developer|Package Maintainer |:-:|:-:|:-:|:-:| MacOS Homebrew Formula|link|@vlvkobal|@rickard-von-essen


    Unofficial Linux Packages

    | Linux Distribution | Netdata Version | Maintainer | Related URL | | :-: | :-: | :-: | :-- | | Ubuntu | Release | @gslin | netdata @ gslin ppa https://github.com/firehol/netdata/issues/69#issuecomment-217458543 |


    Embedded Linux

    | Embedded Linux | Netdata Version | Maintainer | Related URL | | :-: | :-: | :-: | :-- | | ASUSTOR NAS | ? | William Lin | https://www.asustor.com/apps/app_detail?id=532 | | OpenWRT | Release | @nitroshift | openwrt package | | ReadyNAS | Release | @NAStools | https://github.com/nastools/netdata | | QNAP | Release | QNAP_Stephane | https://forum.qnap.com/viewtopic.php?t=121518 | | DietPi | Release | @Fourdee | https://github.com/Fourdee/DietPi |


    Linux Containers

    | Containers | Netdata Version | Maintainer | Related URL | | :-: | :-: | :-: | :-- | | Docker | Git | @titpetric | https://github.com/titpetric/netdata |


    Automation Systems

    | Automation Systems | Netdata Version | Maintainer | Related URL | | :-: | :-: | :-: | :-- | | Ansible | git | @jffz | https://galaxy.ansible.com/jffz/netdata/ | | Chef | ? | @sergiopena | https://github.com/sergiopena/netdata-cookbook |


    If you know other maintainers of distributions that should be mentioned, please help me complete the list...

    cc: @mcnewton @philwhineray @alonbl @simonnagl @paulfantom

    area/packaging area/docs 
    opened by ktsaou 95
  • new prometheus format

    new prometheus format

    Based on recent the discussion on #1497 with @brian-brazil, this PR changes the format netdata sends metrics to prometheus.

    One of the key differences of netdata with traditional time-series solutions, is that it organises metrics in hosts having collections of metrics called charts.

    charts

    Each chart has several properties (common to all its metrics):

    chart_id - it serves 3 purposes: defines the chart application (e.g. mysql), the application instance (e.g. mysql_local or mysql_db2) and the chart type mysql_local.io, mysql_db2.io). However, there is another format: disk_ops.sda (it should be disk_sda.ops). There is issue #807 to normalize these better, but until then, this is how netdata works today.

    chart_name - a more human friendly name for chart_id.

    context - this is the same with above with the application instance removed. So it is mysql.io or disk.ops. Alarm templates use this.

    family is the submenu of the dashboard. Unfortunately, this is again used differently in several cases. For example disks and network interfaces have the disk or the network interface. But mysql uses it just to group multiple chart together and postgres uses both (groups charts, and provide different sections for different databases).

    units is the units for all the metrics attached to the chart.

    dimensions

    Then each chart contains metrics called dimensions. All the dimensions of a chart have the same units of measurement and should be contextually in the same category (ie. the metrics for disk bandwidth are read and write and they are both in the same chart).


    So, there are hosts (multiple netdata instances), each has its own charts, each with its own dimensions (metrics).

    The new prometheus format

    The old format netdata used for prometheus was: CHART_DIMENSION{instance="HOST}

    The new format depends on the data source requested. netdata supports the following data sources:

    • as collected or raw, to send the raw values collected
    • average, to send averages
    • sum or volume to send sums

    The default is the one defined in netdata.conf: [backend].data source = average (changing netdata.conf changes the format for prometheus too). However, prometheus may directly ask for a specific data source by appending &source=SOURCE to the URL (SOURCE being one of the options above).

    When the data source is as collected or raw, the format of the metrics is:

    CONTEXT_DIMENSION{chart="CHART",family="FAMILY",instance="HOSTNAME"}
    

    In all other cases (average, sum, volume), it is:

    CONTEXT{chart="CHART",family="FAMILY",dimension="DIMENSION",instance="HOSTNAME"}
    

    The above format fixes #1519

    time range

    When the data source is average, sum or volume, netdata has to decide the time-range it will calculate the average or the sum.

    The first time a prometheus server hits netdata, netdata will respond with the time frame defined in [backend].update every. But for all queries after the first, netdata remembers the last time it was accessed and responds with the time range since the last time prometheus asked for metrics.

    Each netdata server can respond to multiple prometheus servers. It remembers the last time it was accessed, for each prometheus IP requesting metrics. If the IP is not good enough to distinguish prometheus servers, each prometheus may append &server=PROMETHEUS_NAME to the URL. Then netdata will remember the last time it was accessed for each PROMETHEUS_NAME given.

    instance="HOSTNAME"

    instance="HOSTNAME" is sent only if netdata is called with format=prometheus_all_hosts. If netdata is called with format=prometheus, the instance is not added to the metrics.

    host tags

    Host tags are configured in netdata.conf, like this:

    [backend]
        host tags = tag1="value1",tag2="value2",...
    

    Netdata includes this line at the top of the response:

    netdata_host_tags{tag1="value1",tag2="value2"} 1 1499541463610
    

    The tags are not processed by netdata. Anything set at the host tags config option is just copied. netdata propagates host tags to masters and proxies when streaming metrics.

    If the netdata response includes multiple hosts, netdata_host_tags also includes `instance="HOSTNAME".

    opened by ktsaou 93
  • Redis python module + minor fixes

    Redis python module + minor fixes

    1. Nginx is shown as nginx: local in dashboard while using python or bash module.
    2. NetSocketService changed name to SocketService, which now can use unix sockets as well as TCP/IP sockets
    3. changed and tested new python shebang (yes it works)
    4. fixed issue with wrong data parsing in exim.chart.py
    5. changed whitelisting method in ExecutableService. It is very probable that whitelisting is not needed, but I am not sure.
    6. Added redis.chart.py

    I have tested this and it works.

    After merging this I need to take a break from rewriting modules to python. There are only 3 modules left, but I don't have any data to create opensips.chart.py as well as nut.chart.py (so I cannot code parsers). I also need to do some more research to create ap.chart.py since using iw isn't the best method.

    opened by paulfantom 90
  • New journal disk based indexing for agent memory reduction

    New journal disk based indexing for agent memory reduction

    Summary

    The agent requires a lot of memory to index pages and how they map to the actual files that store metrics

    • Produce a new journal index file that the agent will MMAP and use that instead of creating all the entries in memory

    File structure

    The new file based index has a structure that allows quick access of the needed metadata. The file structure consists of

    • File header
    • List of extents
    • List of unique metric identifiers (sorted)
    • Detailed page info for each metric (page @ time information)

    During the agent start up, the journal replay only needs to create the necessary pages (unique metrics) which is very fast (initial tests indicate that is ~x100 faster than the current journal replay). This is aided by the fast that individual pages are not created in memory during startup but only when needed (during data queries).

    Pages that are no longer needed (evicted from the cache) are removed. They will also be removed when unused for more than 10 minutes.

    You can see the number of descriptors in memory under under netdata.dbengine_long_term_page_stats, journal v2 descriptors

    Creation of new journal index files

    When the agent starts it will check if a new index file exists for each journal file that needs to be processed. If it exists, it will use that instead. If the index file does not exist, it will replay the old journal file and use that information to create the new journal file and start using that immediately. The agent will generated new index files for all journals except the last (active) one

    New datafiles while the agent is running

    When a new datafile / journal pair is created the agent will check and create a new journal index file for the journal that was just completed.

    Known issues

    • New journal creation may not trigger index creation for the last journal file do to a race condition (pending transactions)

    Other fixes

    This PR also fixes:

    • [x] Bug in replication where overlapping time ranges were replicated unnecessarily
    • [x] Bug in streaming compression where under certain conditions corrupted data were offered for parsing
    • [x] Children connecting to a parent without compression were disabling compression globally for the host. Now compression is globally disabled only when there is a compression error.
    • [x] DBENGINE was under conditions allowing past time ranges to be injected to the database, resulting in overlapping data pages in the database. After this PR, DBENGINE only allows future data to be stored, relative to the last data collection time.
    Test Plan
    area/packaging area/docs area/web area/health area/collectors area/daemon area/database area/streaming collectors/plugins.d 
    opened by stelfrag 86
  • How to install openvpn plugin

    How to install openvpn plugin

    Question summary

    Hi, I'm new in servers and first time install debian 9 server on VPS. Then install openvpn with openvpn-install script. I try to install few montitoring tools for my server but always fault. Now I found netdata and it works like a charm. Install script is wanderfull ;) To monitor my openvpn server I have to do something with those files: python.d.plugin, ovpn_status_log.chart.py, python.d/ovpn_status_log.conf? I don't see any tutorial so can anyone guide me what to do?

    OS / Environment

    Debian 9 64bit

    Component Name

    openvpn

    Expected results

    see openvpn traffic

    Regards, Przemek

    question area/collectors collectors/python.d 
    opened by PrzemekSkw 83
  • Prototype: monitor disk space usage.

    Prototype: monitor disk space usage.

    This is just a prototype for disccussing some questions at this point.

    This will fix issues #249 and #74 when implemented properly.

    Questions

    1. Should we realy implement this at proc_diskstats.c? This does not get it's values from proc. I implemented it there because the file system data is already there and it produces a graph in this section.
    2. Shall we use statvfs (only mounted filesystems) or statfs (every filesystem)? If we use statfs we have to query mountinfo

    TODO

    • [x] Only add charts for filesystems where disk space is avaiable
    • [x] Do not allocate and free buffer statvfs all the time
    • [ ] Add this feature to the wiki
    • [x] Make unit more readable (TB, GB, MB depending on filesystem size)
    • [x] Do not display disk metrics for containers, only for disks

    This change is Reviewable

    opened by simonnagl 80
  • python.d modules configuration documentation

    python.d modules configuration documentation

    I suggest to add this header in all python.d/*.conf files:

    # netdata python.d.plugin configuration for ${MODULE}
    #
    # This file is in YaML format. Generally the format is:
    #
    # name: value
    #
    # There are 2 sections:
    #  - global variables
    #  - one or more JOBS
    #
    # JOBS allow you to collect values from multiple sources.
    # Each source will have its own set of charts.
    #
    # JOB parameters have to be indented (example below).
    #
    # ----------------------------------------------------------------------
    # Global Variables
    # These variables set the defaults for all JOBs, however each JOB
    # may define its own, overriding the defaults.
    #
    # update_every sets the default data collection frequency.
    # If unset, the python.d.plugin default is used.
    # update_every: 1
    #
    # priority controls the order of charts at the netdata dashboard.
    # Lower numbers move the charts towards the top of the page.
    # If unset, the default for python.d.plugin is used.
    # priority: 60000
    #
    # retries sets the number of retries to be made in case of failures.
    # If unset, the default for python.d.plugin is used.
    # Attempts to restore the service are made once every update_every
    # and only if the module has collected values in the past.
    # retries: 10
    #
    # ----------------------------------------------------------------------
    # JOBS (data collection sources)
    #
    # The default JOBS share the same *name*. JOBS with the same name
    # are mutually exclusive. Only one of them will be allowed running at
    # any time. This allows autodetection to try several alternatives and
    # pick the one that works.
    #
    # Any number of jobs is supported.
    #
    # All python.d.plugin JOBS (for all its modules) support a set of
    # predefined parameters. These are:
    #
    # job_name:
    #     name: myname     # the JOB's name as it will appear at the
    #                      # dashboard (by default is the job_name)
    #                      # JOBs sharing a name are mutually exclusive
    #     update_every: 1  # the JOB's data collection frequency
    #     priority: 60000  # the JOB's order on the dashboard
    #     retries: 10      # the JOB's number of restoration attempts
    #
    # Additionally to the above, ${MODULE} also supports the following.
    #
    

    where ${MODULE} is the name of each module.

    area/docs 
    opened by ktsaou 75
  • Major docker build refactor

    Major docker build refactor

    1. Unify Dockerfiles and move them from top-level dir to docker
    2. Add run.sh script as a container entrypoint
    3. Introduce docker builder stage (previously used only in alpine image)
    4. Removed Dockerfile parts from Makefile.am
    5. Allow passing custom options to netdata as a docker CMD parameter (bonus from using ENTRYPOINT script)
    6. Run netdata as user netdata with static UID of 201 and /usr/sbin/nologin as shell
    7. Use multiarch/alpine as a base for all images.
    8. One Dockerfile for all platforms

    Initially I've got uncompressed image size reduction from 276MB to 111MB and also size reduction for other images:

    $ docker image ls
    REPOSITORY    TAG       SIZE     COMPRESSED
    netdata       i386      112MB    42MB
    netdata       amd64     111MB    41MB
    netdata       armhf     104MB    39MB
    netdata       aarch64   107MB    39MB
    

    Images are built with ./docker/build.sh command

    Resolves #3972

    opened by paulfantom 74
  • [Feat]: Minimize resources utilization

    [Feat]: Minimize resources utilization

    Problem

    With PR #14125 we make a big step forward to minimize the memory footprint of Netdata agent and make sane use of system resources especially when the agent runs at scale (very busy parent nodes).

    The following items are my observations to make the agent better:

    Threads

    When the agent serves hundreds of nodes, it still wastes a lot of memory and cpu. The problem is mainly the number of threads we spawn. Today, for each child connecting to a parent we use 6 threads:

    1. receiver, to get the metrics from the children
    2. sender, to push the metrics to another parent
    3. health, to check for alerts
    4. ACLK sync, to connect to cloud
    5. ML training, to train new models
    6. ML detection

    With a few hundreds of nodes, each thread needs stack memory (8MB on Linux) and it contributes significantly to context switches (which can grow up to 100k per second with a few hundreds of children).

    All threads except the first should be eliminated:

    1. sender, is mainly idle just dispatching data to a socket - we could have only 1 sender for all children.
    2. health, is also mainly idle and even if there are thousands of alarms it is quite fast and the increased latency will not affect the quality of the result.
    3. ACLK sync, can also become just 1 thread for all children.
    4. ML training and ML prediction per child really kills performance. Users should be able to configure how many ML threads they want to serve all the children. This is probably the most complex to accomplish, because to get the full speed of the db with a single thread, queries need to be pipelined (like replication does in PR #14125 - replication actually cannot get better speed above 5 threads - the optimal is 5 to run at about 10 million points replicated per second).

    Memory Operations

    malloc() is fast. But at scale there is huge contention on memory operations. Millions of them happen per second. In PR #14125 we cached everything that could be cached (the "buffers" chart in "dbengine memory" section) and dbengine2 now performs a lot better even at scale, but still memory operations are made.

    ML and Judy are the main offenders for memory operations now.

    For ML, I have asked @vkalintiris to eliminate all allocations. For Judy I am experimenting with an alternative that could be used in dbengine2 (https://gist.github.com/ktsaou/a08bf8a06f4808b9a223ec5960337244 - I call it July for fun... - it has the same interface as Judy, its allocations can be cached, but it can only be good when items are appended to the hashtable, like dbengine2 does - otherwise it wastes a lot of cpu).

    Streaming

    Now that the replication protocol is stable, we should switch streaming to push interpolated data to parents (like replication does), instead of collected metrics. This will be a huge improvement on CPU utilization on busy parents and it does not seem a big change.

    SQLite3

    We use sqlite3 to have metadata survive across netdata restarts. But sqlite3 is extremely slow for the needs of netdata. Synchronously committing or loading data from SQL should be avoided when possible.

    When data in sqlite3 are only needed on netdata restart, it is probably better to commit SQL statements to a text file which we will run on netdata startup to update sqlite3.

    Description

    See above.

    Importance

    must have

    Value proposition

    Described above.

    Proposed implementation

    No response

    feature request needs triage 
    opened by ktsaou 0
  • [Bug]: Unable to connect to netdata cloud

    [Bug]: Unable to connect to netdata cloud

    Bug description

    Netdata agent is unable to connect to the netdata cloud

    Expected behavior

    Netdata agent should be able to connect to the netdata cloud

    Steps to reproduce

    1. Install as usual using the Kickstarter script
    2. claim agent
    3. check netdatacli aclk-state ...

    Installation method

    kickstart.sh

    System info

    [email protected]:~# uname -a; grep -HvE "^#|URL" /etc/*release
    Linux test 5.15.0-56-generic #62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
    /etc/lsb-release:DISTRIB_ID=Ubuntu
    /etc/lsb-release:DISTRIB_RELEASE=22.04
    /etc/lsb-release:DISTRIB_CODENAME=jammy
    /etc/lsb-release:DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
    /etc/os-release:PRETTY_NAME="Ubuntu 22.04.1 LTS"
    /etc/os-release:NAME="Ubuntu"
    /etc/os-release:VERSION_ID="22.04"
    /etc/os-release:VERSION="22.04.1 LTS (Jammy Jellyfish)"
    /etc/os-release:VERSION_CODENAME=jammy
    /etc/os-release:ID=ubuntu
    /etc/os-release:ID_LIKE=debian
    /etc/os-release:UBUNTU_CODENAME=jammy
    

    Netdata build info

    Version: netdata v1.37.0-107-nightly
    Configure options:  '--build=x86_64-linux-gnu' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--disable-option-checking' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--libdir=/usr/lib' '--libexecdir=/usr/libexec' '--with-user=netdata' '--with-math' '--with-zlib' '--with-webdir=/var/lib/netdata/www' '--disable-dependency-tracking' 'build_alias=x86_64-linux-gnu' 'CFLAGS=-g -O2 -ffile-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security' 'LDFLAGS=-Wl,-Bsymbolic-functions -Wl,-z,relro' 'CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2' 'CXXFLAGS=-g -O2 -ffile-prefix-map=/usr/src/netdata=. -fstack-protector-strong -Wformat -Werror=format-security'
    Install type: binpkg-deb
        Binary architecture: x86_64
        Packaging distro:  
    Features:
        dbengine:                   YES
        Native HTTPS:               YES
        Netdata Cloud:              YES 
        ACLK:                       YES
        TLS Host Verification:      YES
        Machine Learning:           YES
        Stream Compression:         YES
    Libraries:
        protobuf:                YES (system)
        jemalloc:                NO
        JSON-C:                  YES
        libcap:                  NO
        libcrypto:               YES
        libm:                    YES
        tcalloc:                 NO
        zlib:                    YES
    Plugins:
        apps:                    YES
        cgroup Network Tracking: YES
        CUPS:                    YES
        EBPF:                    YES
        IPMI:                    YES
        NFACCT:                  YES
        perf:                    YES
        slabinfo:                YES
        Xen:                     NO
        Xen VBD Error Tracking:  NO
    Exporters:
        AWS Kinesis:             NO
        GCP PubSub:              NO
        MongoDB:                 NO
        Prometheus Remote Write: YES
    Debug/Developer Features:
        Trace Allocations:       NO
    

    Additional info

    Everything can be found on this forum post - https://community.netdata.cloud/t/netdata-agent-is-unable-to-connect-with-netdata-cloud/3588

    bug needs triage 
    opened by smcoder0707 0
  • [Feat]: update netdata infographic to show ML parts

    [Feat]: update netdata infographic to show ML parts

    Problem

    We need to add the ML stuff to the infographic

    Description

    Update infographic to show ML features of agent.

    Importance

    nice to have

    Value proposition

    ...

    Proposed implementation

    No response

    area/docs area/ml 
    opened by andrewm4894 0
  • readme updates

    readme updates

    Summary

    Draft PR to update our main README in a few places where needed.

    discussion in here: https://github.com/netdata/website/issues/95

    Test Plan
    Additional Information
    For users: How does this change affect me?
    opened by andrewm4894 3
  • Assorted infrastructure cleanup.

    Assorted infrastructure cleanup.

    Summary

    This is a catchall for assorted cleanup of old infrastructure files in this repository. Most of these should have been removed when the relevant CI was shut down.

    Test Plan

    n/a

    area/ci area/packaging area/docs area/web area/tests no changelog area/build 
    opened by Ferroin 0
  • update ml defaults and readme

    update ml defaults and readme

    Summary

    This PR updates the below default params for the [ml] section in netdata.conf

    • maximum num samples to train from 14400 (4 hours) to 21600 (6 hours).
    • train every from 3600 (1 hour) to 10800 (3 hours).
    • number of models per dimension from 1 to 9.

    The aim of this is to have anomaly detection in Netdata by default (in steady state once all models trained) covering last 24+ hours in terms of "what it is trained on".

    Test Plan

    This PR will be created as draft first and results from test and experiments of internal dogfooding will be added to the PR to provide a log of testing, results and motivations for this change.

    area/docs area/ml 
    opened by andrewm4894 3
Releases(v1.37.1)
Owner
netdata
netdata
Grafana - The open-source platform for monitoring and observability

The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.

Grafana Labs 53.3k Jan 5, 2023
A Fast and Convenient C++ Logging Library for Low-latency or Real-time Environments

xtr What is it? XTR is a C++ logging library aimed at applications with low-latency or real-time requirements. The cost of log statements is minimised

null 10 Jul 17, 2022
log4cplus is a simple to use C++ logging API providing thread-safe, flexible, and arbitrarily granular control over log management and configuration. It is modelled after the Java log4j API.

% log4cplus README Short Description log4cplus is a simple to use C++17 logging API providing thread--safe, flexible, and arbitrarily granular control

null 1.4k Jan 4, 2023
Colorful Logging is a simple and efficient library allowing for logging and benchmarking.

Colorful-Logging "Colorful Logging" is a library allowing for simple and efficient logging as well for benchmarking. What can you use it for? -Obvious

Mateusz Antkiewicz 1 Feb 17, 2022
View and log aoe-api requests and responses

aoe4_socketspy View and log aoe-api requests and responses Part 1: https://www.codereversing.com/blog/archives/420 Part 2: https://www.codereversing.c

Alex Abramov 10 Nov 1, 2022
Portable, simple and extensible C++ logging library

Plog - portable, simple and extensible C++ logging library Pretty powerful logging library in about 1000 lines of code Introduction Hello log! Feature

Sergey Podobry 1.6k Dec 29, 2022
A DC power monitor and data logger

Hoverboard Power Monitor I wanted to gain a better understanding of the power consumption of my hoverboard during different riding situations. For tha

Niklas Roy 22 May 1, 2021
An ATTiny85 implementation of the well known sleep aid. Includes circuit, software and 3d printed case design

dodowDIY An ATTiny85 implementation of the well known sleep aid. Includes circuit, software and 3d printed case design The STL shells are desiged arou

null 15 Sep 4, 2022
A BSD-based OS project that aims to provide an experience like and some compatibility with macOS

What is Helium? Helium is a new open source OS project that aims to provide a similar experience and some compatibiilty with macOS on x86-64 sytems. I

Zoë Knox 4.7k Dec 30, 2022
A revised version of NanoLog which writes human readable log file, and is easier to use.

NanoLogLite NanoLogLite is a revised version of NanoLog, and is easier to use without performance compromise. The major changes are: NanoLogLite write

Meng Rao 26 Nov 22, 2022
Receive and process logs from the Linux kernel.

Netconsd: The Netconsole Daemon This is a daemon for receiving and processing logs from the Linux Kernel, as emitted over a network by the kernel's ne

Facebook 33 Oct 5, 2022
Minimalistic logging library with threads and manual callstacks

Minimalistic logging library with threads and manual callstacks

Sergey Kosarevsky 20 Dec 5, 2022
Compressed Log Processor (CLP) is a free tool capable of compressing text logs and searching the compressed logs without decompression.

CLP Compressed Log Processor (CLP) is a tool capable of losslessly compressing text logs and searching the compressed logs without decompression. To l

null 516 Dec 30, 2022
Log.c2 is based on rxi/log.c with MIT LICENSE which is inactive now. Log.c has a very flexible and scalable architecture

log.c2 A simple logging library. Log.c2 is based on rxi/log.c with MIT LICENSE which is inactive now. Log.c has a very flexible and scalable architect

Alliswell 2 Feb 13, 2022
PikaScript is an ultra-lightweight Python engine with zero dependencies and zero-configuration, that can run with 4KB of RAM (such as STM32G030C8 and STM32F103C8), and is very easy to deploy and expand.

PikaScript 中文页| Star please~ 1. Abstract PikaScript is an ultra-lightweight Python engine with zero dependencies and zero-configuration, that can run

Lyon 906 Dec 29, 2022
Parca-agent - eBPF based always-on profiler auto-discovering targets in Kubernetes and systemd, zero code changes or restarts needed!

Parca Agent Parca Agent is an always-on sampling profiler that uses eBPF to capture raw profiling data with very low overhead. It observes user-space

Parca 254 Jan 1, 2023
KDevelop plugin for automatic time tracking and metrics generated from your programming activity.

Wakatime KDevelop Plugin Installation instructions Make sure the project is configured to install to the directory of your choice: In KDevelop, select

snotr 6 Oct 13, 2021