Energy Monitoring in Distributed Systems: A Practical Approach

February 24, 2025

The increasing adoption of cloud-native architectures and micro-services has made robust observability frameworks essential for maintaining performance and ensuring efficient operation. However, while traditional observability tools effectively track metrics like latency, throughput, and error rates, they frequently overlook the critical aspect of energy consumption. This blog post examines how integrating energy monitoring tools (RAPL and Kepler) with OpenTelemetry provides granular energy insights from process to system level, offering a holistic performance view. It details an integration methodology, addresses energy observability challenges (accuracy, granularity, standardisation, overhead), and explores optimisations for energy-efficient cloud computing.

The Role of Observability in Achieving Energy Efficiency in Distributed Cloud Systems

With the rapid growth of distributed systems and cloud-native computing that combines public and local or on-prem clouds, observability has become a fundamental component of modern software engineering because energy efficiency is a critical aspect of modern distributed systems, especially as the demand for computational resources continues to increase. The proliferation of cloud computing, IoT devices and large-scale data centers has led to an exponential increase in energy consumption. This increase not only impacts operational costs, but also presents significant environmental concerns due to the carbon footprint associated with high energy usage. As a result, organisations are increasingly recognising the need to implement robust energy monitoring mechanisms that can provide real-time information on energy consumption patterns in their distributed architectures. In this way, they can identify inefficiencies and optimise resource allocation, ultimately resulting in reduced operating costs and reduced environmental impact.

At the same time, distributed multi-domain observability is challenged by heterogeneous tools and data across cloud providers, leading to complex management and high data volumes. Integrating public and private clouds requires a strategic approach to centralised visibility and problem response. This challenge is being addressed through unified observability architectures, composed of distinct components, each handling specific aspects of monitoring hybrid cloud infrastructures while managing large data volumes and meeting security/regulatory requirements.

Other critical aspects, as organisations grow, scaling up energy monitoring tools becomes a challenge due to performance bottlenecks and management complexities. Existing solutions, often designed for smaller deployments, can require costly upgrades or replacements for larger infrastructures, making effective energy management difficult. In addition, the increasing complexity of these tools, which include features such as real-time analytics and predictive modeling, requires specialised expertise to interpret data and make informed decisions to achieve energy efficiency improvements, especially when automated workload optimisation solutions are lacking.

Consequently, for energy monitoring strategies to be successful, they must take into account both scalability and the ability to visualise real-time information based on the consumption of different infrastructure resources to achieve energy efficiency goals. Thus, solutions that can seamlessly adapt to growing infrastructures and offer configurable and comprehensive interfaces to facilitate better decision-making processes in energy management are crucial.

Accordingly, a centralised observability platform ideally should provide:

Mechanisms to collect, process and visualise data from various public and private cloud clusters, providing a unified view.
Lightweight agents distributed across clusters to collect and transmit metrics to the platform, ensuring consistent observability regardless of cluster type.
Data normalisation mechanisms, using standardised metrics and logs (including a defined taxonomy for essential data such as latency, resource utilisation and energy consumption), and enabling uniform data processing and enrichment with metadata (e.g., cluster origin, environment) for efficient correlation.
Multi-cluster configurable dashboards that provide real-time aggregated views of cluster status, enabling rapid error detection.
Provide dynamic alerting mechanisms based on behaviour that adapts to cluster variability. Automated systems should autonomously address detected problems through actions such as service rescaling or workload migration for optimal load balancing and energy reduction..
Finally, distributed tracing to analyse request flows between services and clusters, which is crucial for troubleshooting problems in distributed environments.

The proposed solution is based on OpenTelemetry, which emerges as a robust open-source observability framework, allowing for the integration of energy metrics into distributed systems, thus enhancing our ability to monitor and analyse energy consumption patterns. It is a project of the Cloud Native Computing Foundation (CNCF), which defines an industry standard for telemetry data management, as the basis for its data collection and processing capabilities. This standardised approach enables the collection, processing and export of metrics, logs and traces in a consistent, vendor-independent manner. However, conventional observability does not inherently provide insights into energy consumption, an essential factor for optimising resource efficiency and sustainability. This blog post aims to bridge the gap presenting the integration of OpenTelemetry with energy monitoring tools.

Meanwhile, tools like RAPL and Keepler provide vital insights into power management and performance monitoring, respectively:

RAPL: Intel’s Running Average Power Limit provides an interface for measuring CPU energy consumption at fine-grained intervals.
Kepler: A Kubernetes-based energy profiler that integrates with the Linux kernel and eBPF to track energy usage across containers and nodes.

The following image shows a high level architecture of the proposed solution for monitoring energy consumption in distributed systems:

Integrating energy data with OpenTelemetry involves three key steps: instrumentation (adding energy data to existing telemetry), exporting data (using the OpenTelemetry Collector to aggregate and send energy data to platforms like Prometheus/Grafana), and context propagation (linking energy metrics to specific service requests for detailed analysis).

In the figure we can see a centralised observability platform capable of collecting, processing and visualising data from multiple clusters, regardless of whether they are in public (AWS, Azure, GCP, etc.) or private clouds, allowing a unified view.

The OpenTelemetry Collector is the central component of the framework and acts as a versatile proxy for receiving, processing and forwarding telemetry data. Lightweight monitoring agents are distributed in the different clusters and configured according to the specific capabilities and characteristics of each node where they are deployed (Telemetry agents), in charge of collecting metrics (Producers) and sending them to the centralised platform (Telemetry Controller), allowing that, regardless of the nature of the cluster, whether public or private, it is under the same observability umbrella. At the same time, data standardisation mechanisms must be in place, through metrics and log standards, to establish uniform standards for the metrics, logs and traces collected in each cluster.

To this end, a taxonomy should be defined that identifies essential system data (latency, CPU/memory usage, errors, etc.) as well as energy metrics such as consumption and the way in which these will be labeled and processed, allowing data enrichment, ensuring that metrics and logs include relevant metadata, such as the origin of the cluster, the application, or the environment, to facilitate their subsequent correlation and analysis. Finally, multi-cluster monitoring dashboards (Grapana) enable real-time infrastructure monitoring, providing aggregated, real-time views of the status of all clusters, allowing infrastructure managers or orchestrators to make better decisions based on information about energy consumption.

Note that the Telemetry Agent supports local storage of metrics using Prometheus, and the OpenTelemetry Collector uses Thanos for long-term storage to store metrics due to its support for Prometheus and advanced features such as unlimited retention and data compaction. In addition to metrics, log storage is handled by Loki, selected for its compatibility and ease of use. And Grafana serves as the visualisation platform for the collected data. For trace management, the solution chosen is Jaeger, which is an open-source distributed traceability system.

In conclusion, the integration of OpenTelemetry with energy monitoring tools such as RAPL and Kepler enables full observability of energy consumption in distributed systems, based on open source solutions. However, some challenges remain, such as the standardisation of energy telemetry APIs and the reduction of instrumentation overhead to improve adoption in large-scale cloud environments.

References

[1] A. Chatzipapas et al., “Energy-aware observability in distributed cloud systems,” IEEE Transactions on Cloud Computing, vol. 10, no. 3, pp. 567-579, 2023.

[2] A. Gupta and P. Sharma, “Optimizing cloud workloads with energy-efficient observability,” Journal of Sustainable Computing, vol. 8, pp. 45-58, 2023.

[3] OpenTelemetry Project, “OpenTelemetry Specification,” 2024. [Online]. Available: https://opentelemetry.io/

[4] C. O’Reilly, Observability Engineering: Achieving Production Excellence, O’Reilly Media, 2022.

[5] Scaphandre Project, “Scaphandre: Energy Monitoring for Cloud Applications,” 2023. [Online]. Available: https://github.com/hubblo-org/scaphandre

[6] Intel Corporation, “Intel RAPL Interface Documentation,” 2023. [Online]. Available: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-rapl.html

[7] Kepler Project, “Kepler: Kubernetes-based Energy Profiler,” 2023. [Online]. Available: https://github.com/sustainable-computing-io/kepler

[8] A. Lefevre and J. Smith, “PowerAPI: A flexible framework for software-defined power metering,” IEEE Transactions on Green Computing, vol. 6, no. 4, pp. 302-315, 2022.

[9] Prometheus Project, “Node Exporter: Collecting power metrics in Linux,” 2023. [Online]. Available: https://github.com/prometheus/node_exporter

[10] S. Banerjee et al., “Challenges in fine-grained energy monitoring of cloud applications,” ACM SIGMETRICS Performance Evaluation Review, vol. 50, no. 1, pp. 78-89, 2023.

[11] M. Tanaka et al., “Reducing observability overhead in energy-efficient computing environments,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 12, no. 2, pp. 119-132, 2023.

[12] B. Holtzman, “Standardizing energy observability APIs for multi-cloud environments,” Proceedings of the 2023 IEEE International Conference on Cloud Engineering (IC2E), pp. 67-79, 2023.

Author

Atos – EVIDEN

Jesús Benedicto is a Systems Analyst/Software Architect at ATOS (EVIDEN) BDS Research & Development with over 18+ years of technological experience in software design and integration in industrial markets, with specialisation in Big Data and Cloud-Edge-IoT technologies. Mainly focused on innovation in digital transformation and Smart automation in the manufacturing and telco sectors.

Energy Monitoring in Distributed Systems: A Practical Approach

The Role of Observability in Achieving Energy Efficiency in Distributed Cloud Systems

Author

Terms & Privacy

Write us a message

Connect with us

Energy Monitoring in Distributed Systems: A Practical Approach

The Role of Observability in Achieving Energy Efficiency in Distributed Cloud Systems

Author

Share this post:

Terms & Privacy

Write us a message

Connect with us