Skip to content

[Bug]: Offline node instances cause clutter and confusion #1119

@ralphm

Description

@ralphm

Bug description

Every node in Netdata has one or more node instances: the combination of the Node ID, and the Claim ID of the Agent that represents the node. For nodes connected directly to Cloud (hops = 0), the Claim ID is that of the node's Agent itself, for nodes connected through a Parent (including virtual nodes), it is the Parent Agent's Claim ID (hops > 0). Metrics data can be available at any of the node instances and alerts are fired on behalf of a node instance. Cloud ensures proper query routing and alert aggegation.

The node instances with hops > 0 can currently be in the following states:

  • Online: the node is available (the child/leaf Agent is running and providing new metrics) and queryable (there is metrics data available for querying).
  • Stale: the node is not available, but there's still previous data stored that can be queried.
  • Offline: the node is not available and there's no metrics data to be queried.
  • Pruned: the node is not available, not queryable, and not included in requests for lists of node instances.

The offline state can be achieved in two ways:

  1. The Agent where the node instance resides actively signals Cloud that the node is not queryable there anymore. Either because it was explicitly marked as removed, or because its metrics data has expired from retention.
  2. The Agent where the node instance resides is not connected to Cloud itself (offline).

The pruned state is when a node instance has been offline for more than 60 days.

The problem is that the state offline does not bring any value to the user: the Agent does not have any data to be queried anymore and just confuses users. E.g. when moving a child Agent to stream to a different set of parent Agents.

Additionally, the pruned state has no clear function internally.

Expected behavior

  1. Records for node instances with hops > 0 should be removed completely when its Agent signals Cloud explicitly that the node is offline henceforth.
  2. The pruned state should not exist.

Node instances can then only be offline in two cases:

  • The node is claimed directly to Cloud, but not currently connected (hops = 0).
  • The Agent representing the node (a Parent), is not currently connected (hops > 1).

Steps to reproduce

  1. Have an Agent stream to a parent.
  2. Stop streaming to that parent.
  3. Explicitly mark that node as removed on the Parent, using netdatacli.

Screenshots

No response

Error Logs

No response

Desktop

No response

Additional context

After implementing this, the database needs to be cleaned up to remove all node instances with hops > 0 in the state offline, as well as all node instances in the state pruned.

Existing events for node-state-offline should have event.reason filled with the reason to be considered offline: the parent Agent told us so or Cloud decided based on the state of the parent Agent.

It may be useful to introduce events for the creation and removal of node instances (again with event.reason).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions