This workshop aims to introduce how to make a Java application fully observable with:

During this workshop we will use OpenTelemetry and a Grafana stack:

We will also cover the:

Application High Level Design

The Easy Pay System

API Gateway

Centralises all API calls.

Easy Pay Service

Payment microservices which accepts (or not) payments.

This is how it validates every payment:

  1. Check the POS (Point of Sell) number
  2. Check the credit card number
  3. Check the credit card type
  4. Check the payment threshold, it calls the Smart Bank Gateway for authorization

If the payment is validated, it stores it and broadcasts it to all the other microservices through Kafka.

Fraud detection Service

After fetching a message from the Kafka topic, this service search in its database if the payment's card number is registered for fraud.

In this case, only a WARN log is thrown.

Merchant Back Office Service

For this lab, it only simulates the subscription of messages.

Smart Bank Gateway

This external service authorizes the payment.

Our fully observable platform

The Easy Pay observable System

Short explanation

As mentioned earlier, our observability stack is composed of :

In addition, the microservices are started with an agent to broadcast their telemetry to the collector.

Icons

You will the following icons during the workshop:

๐Ÿ› ๏ธ An action to perform,
๐Ÿ“ A file to modify,
๐Ÿ‘€ Something to observe,
โœ… Validate something,
โ„น๏ธ Some information.

Skills

Skill

Level

REST API

proficient

Java

proficient

Gradle

novice

Spring Framework, Boot, Cloud Config, Cloud Gateway

proficient

Docker

novice

Grafana stack

novice

Prometheus

novice

Kafka

novice

Tools

If you want to execute this workshop locally

You MUST have set up these tools first:

๐Ÿ› ๏ธ Here are commands to validate your environment:

Java

$ java -version

openjdk version "21.0.3" 2024-04-16 LTS
OpenJDK Runtime Environment Temurin-21.0.3+9 (build 21.0.3+9-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.3+9 (build 21.0.3+9-LTS, mixed mode, sharing)

Gradle

๐Ÿ› ๏ธ If you use the wrapper, you won't have troubles. Otherwise...:

$ gradle -version

------------------------------------------------------------
Gradle 8.7
------------------------------------------------------------

Build time:   2024-03-22 15:52:46 UTC
Revision:     650af14d7653aa949fce5e886e685efc9cf97c10

Kotlin:       1.9.22
Groovy:       3.0.17
Ant:          Apache Ant(TM) version 1.10.13 compiled on January 4 2023
JVM:          21.0.3 (Eclipse Adoptium 21.0.3+9-LTS)
OS:           Linux 5.15.146.1-microsoft-standard-WSL2 amd64

Docker Compose

$ docker compose version
    
Docker Compose version v2.24.7

If you don't want to bother with a local setup

It's strongly recommended to use Gitpod or GitHub Codespaces. You must create an account first. You then can open this project in either your local VS Code or directly in your browser:

Open in Gitpod

Open GitPod

We will assume you will use GitPod for this workshop :)

Open in Gitpod

When a messages invites you making an URL public, select and validate it.

Start the infrastructure

The "infrastructure stack" is composed of the following components:

โ„น๏ธ If you run your application on GitPod, the following step are automatically started during the provisioning of your GitPod environment.

๐Ÿ› ๏ธ Otherwise, to run it on your desktop, execute the following commands

$ bash scripts/download-agent.sh
$ ./gradlew tasks
$ docker compose up -d --build --remove-orphans

โœ… To check if all the services are up, you can run this command:

$ docker compose ps -a

And check the status of every service.

For instance:

โฏ docker compose ps
NAME                                               IMAGE                                         COMMAND                  SERVICE                   CREATED         STATUS                        PORTS
api-gateway                                        api-gateway:latest                            "java -javaagent:/op..."   api-gateway               3 minutes ago   Up 2 minutes (healthy)        0.0.0.0:8080->8080/tcp, :::8080->8080/tcp
config-server                                      config-server:latest                          "java -javaagent:/op..."   config-server             3 minutes ago   Up 3 minutes (healthy)        0.0.0.0:8888->8888/tcp, :::8888->8888/tcp
discovery-server                                   discovery-server:latest                       "java -javaagent:/op..."   discovery-server          3 minutes ago   Up 3 minutes (healthy)        0.0.0.0:8761->8761/tcp, :::8761->8761/tcp
easypay-service                                    easypay-service:latest                        "java -cp app:app/li..."   easypay-service           3 minutes ago   Up 2 minutes (healthy)        
fraudetect                                         fraudetect-service:latest                     "java -javaagent:/op..."   fraudetect-service        3 minutes ago   Up 2 minutes (healthy)        
kafka                                              confluentinc/cp-kafka:7.6.1                   "/etc/confluent/dock..."   kafka                     3 minutes ago   Up 3 minutes (healthy)        9092/tcp, 0.0.0.0:19092->19092/tcp, :::19092->19092/tcp
merchant-backoffice                                merchant-backoffice:latest                    "java -javaagent:/op..."   merchant-backoffice       3 minutes ago   Up 2 minutes (healthy)        
observability-workshop-grafana-1                   grafana/grafana:latest                        "sh -xeuc 'mkdir -p ..."   grafana                   3 minutes ago   Up 3 minutes                  0.0.0.0:3000->3000/tcp, :::3000->3000/tcp
observability-workshop-loki-1                      grafana/loki:latest                           "/usr/bin/loki -conf..."   loki                      3 minutes ago   Up 3 minutes                  0.0.0.0:3100->3100/tcp, :::3100->3100/tcp
observability-workshop-opentelemetry-collector-1   otel/opentelemetry-collector-contrib:latest   "/otelcol-contrib --..."   opentelemetry-collector   3 minutes ago   Up 3 minutes                  0.0.0.0:4317-4318->4317-4318/tcp, :::4317-4318->4317-4318/tcp, 55678-55679/tcp
observability-workshop-postgres-easypay-1          postgres:16                                   "docker-entrypoint.s..."   postgres-easypay          3 minutes ago   Up 3 minutes (healthy)        0.0.0.0:5432->5432/tcp, :::5432->5432/tcp
observability-workshop-postgres-fraudetect-1       postgres:16                                   "docker-entrypoint.s..."   postgres-fraudetect       3 minutes ago   Up 3 minutes (healthy)        0.0.0.0:5434->5432/tcp, [::]:5434->5432/tcp
observability-workshop-postgres-merchantbo-1       postgres:16                                   "docker-entrypoint.s..."   postgres-merchantbo       3 minutes ago   Up 3 minutes (healthy)        0.0.0.0:5435->5432/tcp, [::]:5435->5432/tcp
observability-workshop-postgres-smartbank-1        postgres:16                                   "docker-entrypoint.s..."   postgres-smartbank        3 minutes ago   Up 3 minutes (healthy)        0.0.0.0:5433->5432/tcp, [::]:5433->5432/tcp
observability-workshop-prometheus-1                prom/prometheus:latest                        "/bin/prometheus --c..."   prometheus                3 minutes ago   Up 3 minutes                  0.0.0.0:9090->9090/tcp, :::9090->9090/tcp
observability-workshop-pyroscope-1                 grafana/pyroscope:latest                      "/usr/bin/pyroscope ..."   pyroscope                 12 hours ago    Exited (255) 38 minutes ago   0.0.0.0:4040->4040/tcp, :::4040->4040/tcp
observability-workshop-tempo-1                     grafana/tempo:latest                          "/tempo -config.file..."   tempo                     3 minutes ago   Up 3 minutes                  0.0.0.0:3200->3200/tcp, :::3200->3200/tcp, 0.0.0.0:9095->9095/tcp, :::9095->9095/tcp, 0.0.0.0:9411->9411/tcp, :::9411->9411/tcp
smartbank-gateway                                  smartbank-gateway:latest                      "java -Xmx4g -cp app..."   smartbank-gateway         3 minutes ago   Up 2 minutes (healthy)

Validation

โœ… Open the Eureka website started during the infrastructure setup. The following instances should be registered with Eureka:

โœ… All services should be registered before continuing...

๐Ÿ› ๏ธ You can now access our platform to initiate a payment:

$ http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780008 expiryDate=789456123 amount:=25000

โœ… You should get the following content:

HTTP/1.1 201 Created
Content-Type: application/json
Date: Wed, 05 Jun 2024 13:42:12 GMT
Location: http://172.19.25.95:44523/payments/3cd8df14-8c39-460b-a429-dc113d003aed
transfer-encoding: chunked

{
    "amount": 25000,
    "authorId": "5d364f1a-569c-4c1d-9735-619947ccbea6",
    "authorized": true,
    "bankCalled": true,
    "cardNumber": "5555567898780008",
    "cardType": "MASTERCARD",
    "expiryDate": "789456123",
    "paymentId": "3cd8df14-8c39-460b-a429-dc113d003aed",
    "posId": "POS-01",
    "processingMode": "STANDARD",
    "responseCode": "ACCEPTED",
    "responseTime": 414
}

Some functional issues

๐Ÿ› ๏ธ It's time to have a look to our easypay-service logs!
Service is started in a Docker container. To get its output you can use the following command:

$ docker compose logs -f easypay-service

One of our customers raised an issue:

ยซ When I reach your API, I usually either an AMOUNT_EXCEEDED or INVALID_CARD_NUMBER error. ยป

Normally the first thing to do is checking the logs. Before that, we will reproduce this behavior.

๐Ÿ› ๏ธ You can check the API as following:

$ http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780008 expiryDate=789456123 amount:=51000

HTTP/1.1 201 Created
Content-Type: application/json
Date: Wed, 05 Jun 2024 13:45:40 GMT
Location: http://172.19.25.95:44523/payments/5459b20a-ac91-458f-9578-019c05483bb3
transfer-encoding: chunked

{
    "amount": 51000,
    "authorId": "6ace318f-b669-4e4a-b366-3f09048becb7",
    "authorized": false,
    "bankCalled": true,
    "cardNumber": "5555567898780008",
    "cardType": "MASTERCARD",
    "expiryDate": "789456123",
    "paymentId": "5459b20a-ac91-458f-9578-019c05483bb3",
    "posId": "POS-01",
    "processingMode": "STANDARD",
    "responseCode": "AUTHORIZATION_DENIED",
    "responseTime": 25
}
$  http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780007 expiryDate=789456123 amount:=25000

HTTP/1.1 201 Created
Content-Type: application/json
Date: Wed, 05 Jun 2024 13:46:09 GMT
Location: http://172.19.25.95:44523/payments/2dbf3823-fb11-4c63-a540-ab43ac663e68
transfer-encoding: chunked

{
    "amount": 51000,
    "authorId": null,
    "authorized": false,
    "bankCalled": false,
    "cardNumber": "5555567898780007",
    "cardType": null,
    "expiryDate": "789456123",
    "paymentId": "2dbf3823-fb11-4c63-a540-ab43ac663e68",
    "posId": "POS-01",
    "processingMode": "STANDARD",
    "responseCode": "INVALID_CARD_NUMBER",
    "responseTime": 5
}

๐Ÿ‘€ Look into the console logs to pinpoint these issues.

Additional logs

It's time to add more contextual information into our code!

We will use in this workshop SLF4J. It is a logging facade that provides a simple API to log messages, and which can be bind to different logging frameworks (Logback, Log4j, etc.).

The logger can be created by adding a static class variable such as:

  private static final Logger log = LoggerFactory.getLogger(PaymentResource.class);

Think to use the corresponding class to instantiate it!

What about log levels?

Use the most appropriate log level

The log level is a fundamental concept in logging. Whether the logging framework you use, it allows you to tag log records according to their severity or importance. For instance, SLF4J offers the following log levels by default:

We can also log payment requests and responses to provide even more context, which could be helpful for following requests in this workshop.

Add logs

๐Ÿ“ Modify the easypay-service/src/main/java/com/worldline/easypay/payment/boundary/PaymentResource.java class by uncommenting all // LOG.... lines (keep MDC lines for later ๐Ÿ˜‰).

Ze technical issue

Another issue was raised for the POS (Point of Sell) POS-02 (but we didn't know yet!).

๐Ÿ› ๏ธ When you reach the API using this command:

http POST :8080/api/easypay/payments posId=POS-02 cardNumber=5555567898780008 expiryDate=789456123 amount:=25000

๐Ÿ‘€ You should get the following output:

HTTP/1.1 500
[...]

{
  "error":"Internal Server Error",
  "path": "/payments",
  "status": 500,
  [...]
}

๐Ÿ‘€ You then get the following log message:

2024-06-05T15:45:35.215+02:00 ERROR 135386 --- [easypay-service] [o-auto-1-exec-7] o.a.c.c.C.[.[.[/].[dispatcherServlet]    : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed: java.lang.NullPointerException: Cannot invoke "java.lang.Boolean.booleanValue()" because "java.util.List.get(int).active" is null] with root cause

java.lang.NullPointerException: Cannot invoke "java.lang.Boolean.booleanValue()" because "java.util.List.get(int).active" is null
        at com.worldline.easypay.payment.control.PosValidator.isActive(PosValidator.java:34) ~[main/:na]
        at com.worldline.easypay.payment.control.PaymentService.process(PaymentService.java:46) ~[main/:na]
        at com.worldline.easypay.payment.control.PaymentService.accept(PaymentService.java:108) ~[main/:na]
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ~[na:na]
        at java.base/java.lang.reflect.Method.invoke(Method.java:580) ~[na:na]
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:354) ~[spring-aop-6.1.6.jar:6.1.6]
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:196) ~[spring-aop-6.1.6.jar:6.1.6]
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-6.1.6.jar:6.1.6]
    [...]

Let's fix it!

๐Ÿ“ To find the root cause, add first a smart log entry in the easypay-service/src/main/java/com/worldline/easypay/payment/control/PosValidator.java class. In the isActive() method, catch the exception and trace the error:

public boolean isActive(String posId) {
    PosRef probe = new PosRef();
    probe.posId = posId;
    try {
        List<PosRef> posList = posRefRepository.findAll(Example.of(probe));

        if (posList.isEmpty()) {
            LOG.warn("checkPosStatus NOK, unknown posId {}", posId);
            return false;
        }

        boolean result = posList.get(0).active;

        if (!result) {
            LOG.warn("checkPosStatus NOK, inactive posId {}", posId);
        }
        return result;
    } catch (NullPointerException e) {
        LOG.warn("Invalid value for this POS: {}", posId);
        throw e;
    }
}

๐Ÿ› ๏ธ You can redeploy the easypay-service with the following commands:

$ docker compose up -d --build easypay-service

๐Ÿ› ๏ธ Now you can run the same commands ran earlier and check again the logs (http POST...).

This is most likely an issue with some data in the database... Thanks to logs, we may quickly get the idea that the problem comes from the point of sale with ID POS-02.

It is easy to identify the issue if we don't have traffic against our application, but what if we have more realistic traffic?

๐Ÿ› ๏ธ Generate some load with k6 (a Grafana tool for load testing):

$ k6 run -u 5 -d 30s k6/01-payment-only.js

๐Ÿ‘€ Check again logs:

$ docker compose logs -f easypay-service

Logs are now interleaved due to several requests being executed concurrently. It becomes harder to identify which point of sale is causing this error...

๐Ÿ‘€ If you look into the SQL import file (easypay-service/src/main/resources/db/postgresql/data.sql), you'll notice a NULL value instead of a boolean for the active column.

Using Mapped Diagnostic Context (MDC) to get more insights

Mapped Diagnostic Context (MDC) will help us add more context on every log output. For more information, refer to this web page: https://logback.qos.ch/manual/mdc.html. It is a kind of Map attached to the Thread context and maintained by the logging framework where you can put values and use them in your log layout.

As it is attached to a Thread, we can put values at the beginning of a request and print them in all the logs related to this request. That can give you a lot of information about the context of the request, without having to add them to all logs manually.

A good sketch is better than a long speech. In the next section, we will use MDC to add the card number and the POS ID to all the logs related to a request.

๐Ÿ“ Go to the easypay-service/src/main/java/com/worldline/easypay/payment/boundary/PaymentResource class and modify the method processPayment() to instantiate the MDC:

public ResponseEntity<PaymentResponse> processPayment(PaymentRequest paymentRequest) {
    // Add cardNumber to SLF4J MDC 
    MDC.put("cardNumber",paymentRequest.cardNumber());
    // Add Point Of Sale identifier to SLF4J MDC
    MDC.put("pos",paymentRequest.posId());
            
    try { // Add a try-finally construct and wrap the initial code here 
        //...
        return httpResponse;
    } finally {
        // Clear MDC at the end
        MDC.clear();
    }
}

Now, we want to print these values when a log line is printed in the console.

๐Ÿ“ Modify to the spring configuration file (easypay-service/src/main/resources/application.yaml) and modify the logging.level.pattern property to add both the cardNumber & posfields to all logs:

logging:
  pattern:
    level: "%5p [%mdc]"
# Alternative to print specific fields instead
logging:
  pattern:
    level: "%5p [%X{cardNumber} - %X{pos}]"

๐Ÿ› ๏ธ Rebuild and redeploy the easypay-service:

$ docker compose up -d --build easypay-service

Adding more content in our logs

๐Ÿ› ๏ธ To have more logs, we will run several HTTP requests using K6. Run the following command:

$ k6 run -u 5 -d 5s k6/01-payment-only.js

๐Ÿ‘€ Check then the logs to pinpoint some exceptions.

Logs Correlation

Let's dive into our logs on Grafana!

There are several ways to export logs to a log collector such as Loki or an ElasticSearch instance:

โ„น๏ธ We target Loki as our log storage backend.

โ„น๏ธ Exporting content of a log file

This approach consists in letting your logging framework write your logs into a file, which is read by another process to send them to a log collector.

It is a common way to export logs. You just have to configure a log file processor such as FileBeat or Promtail to read the file and send its content in the corresponding log backend.

It also requires to format your logs in a structured way, such as JSON, to ease its parsing and ingestion in the backend. This is known as Structured Logging.

With Spring Boot, you can write your logs in Elastic Common Schema (ECS), Graylog Extended Log Format (GELF) or Logstash JSON formats: Structured Logging with Spring Boot Logging.

๐Ÿ“ OPTIONAL: You can try to change the console log format to one of ecs, gelf or logstash in easypay-service, by adding to the application.yaml file:

# Structured Logging: could be either ecs, gelf or logstash
logging:
  # ...
  structured:
    format:
      console: ecs

๐Ÿ› ๏ธ Rebuild and redeploy the easypay-service:

$ docker compose up -d --build easypay-service

๐Ÿ‘€ Check logs in the console to see the new format.

Our choice: sending logs directly to the log collector

It is also possible to send logs directly to a log collector by configuring an appender in your logging framework.

This has the advantage of being more real-time than the previous approach.

Loki can ingest logs using its own API or using the OpenTelemetry protocol. So we have several options:

We will use the latter as we focus on what OpenTelemetry can bring to us for the observability of our Java applications ๐Ÿ˜‰, and not only Spring Boot ones!

2 ways of instrumentation: Pros & Cons

There is two ways to instrument the byte code and broadcast telemetry : Using a library or through a Java Agent Here is a short summary of the pros & cons

Java Agent

Library / Starter

If you want to know more about this topic, you can check out this documentation.

Target Architecture

Logs Architecture

The OpenTelemetry Collector is an important component in our architecture: it acts as an ETL (Extract, Transform, Load) process for telemetry data (logs, metrics and traces). It will receive logs from the application, transform them into a format that can be ingested by the log storage backend, and send them to the backend.

A practice is to install a collector on each host where your application is running, or kube node. It will then collect logs from all applications running on the host.

In the next steps, we will attach the OpenTelemetry Agent to the easypay-service, and configure it to send logs to the OpenTelemetry Collector.

OpenTelemetry Java Agent

The OpenTelemetry Java Agent is a Java agent that can be attached to a JVM to automatically instrument your application.

It is able to collect and send all your application telemetry: logs, metrics and traces, for most of the frameworks and libraries you may use in your application.

๐Ÿ‘€ You can have a look to all the supported libraries, frameworks, applications servers and JVMs supported by the Agent.

โ„น๏ธ To attach an agent to a JVM, you just have to add the -javaagent option to the JVM command line.

๐Ÿ› ๏ธ Check if the opentelemetry-javaagent.jar is already downloaded in the instrumentation directory. If the file is missing, invoke the following script to download it:

$ bash scripts/download-agent.sh

โ„น๏ธ opentelemetry-javaagent.jar is available as /opentelemetry-javaagent.jar in the container.

๐Ÿ“ Modify the entrypoint definition in the compose.yml file to attach the OpenTelemetry Java Agent to the easypay-service:

services:
  easypay-service:
    # ...
    volumes:
      - ./instrumentation/opentelemetry-javaagent.jar:/opentelemetry-javaagent.jar
    # ...
    entrypoint:
      - java
      - -javaagent:/opentelemetry-javaagent.jar # < Add this line 
      - -cp
      - app:app/lib/*
      - com.worldline.easypay.EasypayServiceApplication

By default, the OpenTelemetry Agent target endpoint is configured to localhost:4317. It is overridable by setting the system property otel.exporter.otlp.endpoint or by using the OTEL_EXPORTER_OTLP_ENDPOINT environment variable.

โ„น๏ธ Our collector is listening on http://opentelemetry-collector:4317.

๐Ÿ“ Add the following environment variables to the easypay-service service in the compose.yml file such as:

services:
  easypay-service:
    # ...
    environment:
      # ...
      OTEL_RESOURCE_ATTRIBUTES: "service.name=easypay-service,deployment.environment=dev,service.namespace=service,service.version=1.0.0,service.instance.id=easypay-service:8080" # (1)
      OTEL_EXPORTER_OTLP_PROTOCOL: grpc # (2)
      OTEL_EXPORTER_OTLP_ENDPOINT: http://opentelemetry-collector:4317 # (3)
    # ...
  1. The OTEL_RESOURCE_ATTRIBUTES environment variable is used to define the service name, the deployment environment, the service namespace, the service version and the service instance id,
  2. The OTEL_EXPORTER_OTLP_PROTOCOL environment variable is used to define the protocol used to send telemetry data to the collector.
  3. The OTEL_EXPORTER_OTLP_ENDPOINT environment variable is used to define the endpoint of the collector.

MDC support

By default, exporting MDC values with the OpenTelemetry Java Agent is experimental and requires an opt-in configuration.

But that does not prevent us from using it in our workshop! We should set the following properties:

Agent can be configured using either:

๐Ÿ“ Modify the entrypoint definition in the compose.yml file to add the following system properties:

services:
  easypay-service:
    # ...
    entrypoint:
      - java
      - -javaagent:/opentelemetry-javaagent.jar
      - -Dotel.instrumentation.logback-appender.experimental-log-attributes=true       # < Add this line
      - -Dotel.instrumentation.logback-appender.experimental.capture-mdc-attributes=*  # < Add this line
      - -cp
      - app:app/lib/*
      - com.worldline.easypay.EasypayServiceApplication

OpenTelemetry Collector

โ„น๏ธ The OpenTelemetry collector is already configured to receive logs and forward metrics to the Loki backend.

๐Ÿ‘€ You can check the collector configuration located in the docker/otelcol/otelcol.yaml file.

receivers:
  # Listen for telemetry data via OpenTelemetry protocol
  otlp: # (1)
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch: # (2)

exporters:
  # Configure an exporter using the OpenTelemetry protocol over HTTP to send logs to Loki
  otlphttp/loki: # (3)
    endpoint: http://loki:3100/otlp

service:
  pipelines:
    # Declare the pipeline for logs processing:
    # 1. Receive logs via the OpenTelemetry protocol
    # 2. Optimize data by batching them (optional but recommended)
    # 3. Export logs to Loki
    logs: # (4)
      receivers: [ otlp ]
      processors: [ batch ]
      exporters: [ otlphttp/loki ]
  1. Declare an input receiver to listen for telemetry data via the OpenTelemetry protocol on ports 4317 (GRPC) and 4318 (HTTP/PROTOBUF),
  2. Declare a processor of type batch to optimize data by batching them,
  3. Declare an exporter to send logs to Loki using the OpenTelemetry protocol over HTTP,
  1. Configure the logs pipeline by defining the receivers, processors, and exporters to use.

๐Ÿ› ๏ธ Redeploy the easypay-service:

$ docker compose up -d easypay-service

โœ… To ensure easypay-service has started up correctly, check its logs with:

$ docker compose logs -f easypay-service

โœ… If Java agent was correctly taken into account, logs should start with:

easypay-service  | OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended
easypay-service  | [otel.javaagent 2025-01-16 15:41:46:550 +0000] [main] INFO io.opentelemetry.javaagent.tooling.VersionLogger - opentelemetry-javaagent - version: 2.11.0

Explore logs with Grafana

๐Ÿ› ๏ธ Go to Grafana (port 3000):

Grafana offers a form to help you build your queries:

You can also use a dedicated query language to make your queries directly: this is named LogQL.

๐Ÿ› ๏ธ Let's get logs from the easypay-service:

๐Ÿ› ๏ธ Do not hesitate to hit the easypay payment endpoint with curl/httpie or k6 to generate some logs (whichever you prefer):

http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780008 expiryDate=789456123 amount:=40000
# OR
k6 run -u 1 -d 2m k6/01-payment-only.js

๐Ÿ‘€ You can also view logs for the other services (e.g., api-gateway).

Maybe another issue? Do you see the card numbers? ๐Ÿ˜จ

Personal Identifiable Information (PII) obfuscation

For compliance and to prevent personal data loss, we will obfuscate the card number in the logs.

The OpenTelemetry collector in its contrib flavor provides a redaction processor we can use to obfuscate sensitive data. The processor can be declared and attached to the log pipeline in order to mask all attributes containing a sensitive value.

๐Ÿ“ Let's add the redaction processor to the OpenTelemetry collector configuration:

(...)

processors:
  batch:

  redaction/card-numbers: # (1)
    allow_all_keys: true
    blocked_values:
      - "4[0-9]{12}(?:[0-9]{3})?" ## VISA
      - "(5[1-5][0-9]{14}|2(22[1-9][0-9]{12}|2[3-9][0-9]{13}|[3-6][0-9]{14}|7[0-1][0-9]{13}|720[0-9]{12}))" ## MasterCard
      - "3(?:0[0-5]|[68][0-9])[0-9]{11}" ## Diners Club
      - "3[47][0-9]{13}" ## American Express
      - "65[4-9][0-9]{13}|64[4-9][0-9]{13}|6011[0-9]{12}|(622(?:12[6-9]|1[3-9][0-9]|[2-8][0-9][0-9]|9[01][0-9]|92[0-5])[0-9]{10})" ## Discover
      - "(?:2131|1800|35[0-9]{3})[0-9]{11}" ## JCB
      - "62[0-9]{14,17}" ## UnionPay
    summary: debug

(...)

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [batch,redaction/card-numbers] # (2)
      exporters: [otlphttp/loki]

(...)
  1. We declare a new processor named redaction/card-numbers that will obfuscate all attributes containing a card number,
  2. We attach the processor to the logs pipeline.

๐Ÿ›  Restart the collector to take into account the new configuration:

docker compose restart opentelemetry-collector

๐Ÿ› ๏ธ Generate some logs with curl/httpie or k6.

โœ… Check the card numbers are now obfuscated with the **** content.

Let's take control of our application's metrics!

We target to collect metrics from our services and forward them to the Prometheus time-series database. As with logs, we will use the OpenTelemetry collector as a gateway to collect and forward these metrics.

Metrics architecture

Expose metrics of our Java application

With Spring Boot, there are several ways to expose application metrics:

We will continue to use the OpenTelemetry Agent, as it is a straightforward way to collect metrics, and we already configured it!

Configure the OpenTelemetry Agent

The Agent should already collect metrics of the application, sending data every 60s. That's a bit long for our workshop, so we will reduce this frequency to 5s.

๐Ÿ“ Modify the entrypoint definition in the compose.yml file to add the following system properties:

services:
  easypay-service:
    # ...
    entrypoint:
      - java
      - -javaagent:/opentelemetry-javaagent.jar
      - -Dotel.instrumentation.logback-appender.experimental-log-attributes=true
      - -Dotel.instrumentation.logback-appender.experimental.capture-mdc-attributes=*
      - -Dotel.metric.export.interval=5000 # < Add this line
      - -cp
      - app:app/lib/*
      - com.worldline.easypay.EasypayServiceApplication

โ„น๏ธ The otel.metric.export.interval system property is used to define the frequency at which metrics are sent to the target endpoint in milliseconds. As a reminder, you can also use environment variables to configure the Agent (otel.metric.export.interval becomes OTEL_METRIC_EXPORT_INTERVAL environment variable name).

Export metrics to Prometheus

Prometheus is a well-known time-series database and monitoring system that scrapes metrics from instrumented applications. It even supports the OTLP protocol to ingest metrics.

Instead of sending them directly, we will keep to use the OpenTelemetry collector to collect metrics and forward them to the target database.

๐Ÿ‘€ For this workshop, the OpenTelemetry collector is already configured to receive metrics and forward them to Prometheus:

# Input
receivers:
  # OpenTelemetry Protocol: logs, metrics and traces
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

#...
# Output
exporters:
  #...
  # Export to Prometheus via HTTP using the OpenTelemetry Protocol
  otlphttp/prometheus:
    endpoint: http://prometheus:9090/api/v1/otlp

# Telemetry processing pipelines
service:
  pipelines:
    # ...
    # Receive metrics using the OpenTelemetry Protocol and export to Prometheus
    metrics:
      receivers: [ otlp ]
      processors: [ batch ]
      exporters: [ otlphttp/prometheus ]

Let's explore the metrics

๐Ÿ› ๏ธ Go to Grafana and start again an Explore dashboard.

๐Ÿ› ๏ธ Select the Prometheus datasource instead of the Loki one.

In this section you will hand on the metrics query builder of Grafana.

The Metric field lists all the metrics available in the Prometheus server: take time to explore them.

๐Ÿ› ๏ธ For example, you can select the metric named jvm_memory_used_bytes, and click on the Run query button to plot the memory usage of all your services by memory area,

๐Ÿ› ๏ธ If you want to plot the total memory usage of your services:

๐Ÿ› ๏ธ You can also filter metrics to be displayed using Label filters: try to create a filter to display only the metric related to the application named easypay-service.

Dashboards

With Grafana, you can either create your own dashboards or import some provided by the community from Grafana's dashboards collection.

We will choose the second solution right now and import the following dashboards:

๐Ÿ› ๏ธ To import these dashboards:

๐Ÿ‘€ Explore the JMX Overview dashboard: it works almost out of box.
It contains a lot of useful information about JVMs running our services.

The job filter (top of the dashboard) let you select the service you want to explore metrics.

Incident!

๐Ÿ› ๏ธ Now let's simulate some traffic using Grafana K6. Run the following command:

$ k6 run -u 5 -d 2m k6/02-payment-smartbank.js

๐Ÿ‘€ Go back to the Grafana dashboard, click on Dashboards and select JVM Micrometer:

We were talking about an incident, isn't it?

๐Ÿ‘€ Let's go back to the Explore view of Grafana, select Loki as a data source and see what happens!

๐Ÿ› ๏ธ Create a query with the following parameters to get error logs of the smartbank-gateway service:

๐Ÿ› ๏ธ Click on Run query and check out the logs.

Normally you would get a java.lang.OutOfMemoryError due to a saturated Java heap space.

๐Ÿ‘€ To get additional insights, you can go back to the JMX Overview dashboard and select the smartbank-gateway application.

Normally you will see the used JVM Heap reaching the maximum allowed.

Business metrics

Observability is not only about incidents. You can also define your own metrics.

OpenTelemetry API provides an API to create your

Let's go back to our code!

Objectives

We want to add new metrics to the easypay service to measure they payment processing and store time.
So we target a metric of Histogram type.

In order to achieve this goal, we will measure the time spent in the two methods process and store of the com.worldline.easypay.payment.control.PaymentService class of the easypay-service module.
This class is the central component responsible for processing payments: it provides the accept public method, which delegates its responsibility to two private ones:

We also want to count the number of payment requests processed by our system. We will use a metric of Counter type.

1. Add the opentelemetry-api dependency

We need to add the opentelemetry-api dependency to easypay-service in order to use the OpenTelemetry API to create custom metrics.

๐Ÿ“ Add the following dependency to the easypay-service build.gradle.kts file:

dependencies {
    // ...
    implementation("io.opentelemetry:opentelemetry-api")
}

2. Declare the histogram

We need to declare two timers in our code:

๐Ÿ“ Let's modify the com.worldline.easypay.payment.control.PaymentService class to declare them:

// ...

import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;

@Service
public class PaymentService {
    // ...
    private LongHistogram processHistogram;  // (1)
    private LongHistogram storeHistogram;

    public PaymentService(/* ... */) {
        // ...

        OpenTelemetry openTelemetry = GlobalOpenTelemetry.get(); // (2)

        processHistogram = openTelemetry.getMeter(EasypayServiceApplication.class.getName())  //(3)
                .histogramBuilder("snowcamp.payment.process")  // (4)
                .setDescription("Payment processing time") // (5)
                .setUnit("ms") // (6)
                .ofLongs() // (7)
                .build();
        storeHistogram = openTelemetry.getMeter(EasypayServiceApplication.class.getName())
                .histogramBuilder("snowcamp.payment.store")
                .setDescription("Payment storing time")
                .setUnit("ms")
                .ofLongs()
                .build();
    }
}
  1. Declare the two timers,
  2. Inject the OpenTelemetry instance to get the Meter object to create the histograms,
  3. Initialize the two histograms by giving them a name (4), a description (5), a unit (6) and setting the type of the values (7).

3. Record time spent in the methods

๐Ÿ“ Let's modify our process and store methods to record our latency with the new metrics.
We can simply wrap our original code in a try-finally construct such as:

    // ...
private void process(PaymentProcessingContext context) {
    long startTime = System.currentTimeMillis(); // (1)
    try { // (2)
        if (!posValidator.isActive(context.posId)) {
            context.responseCode = PaymentResponseCode.INACTIVE_POS;
            return;
        }
        // ...
    } finally {
        long duration = System.currentTimeMillis() - startTime; // (3)
        processHistogram.record(duration); // (4)
    }
}

private void store(PaymentProcessingContext context) {
    long startTime = System.currentTimeMillis(); // (5)
    try {
        Payment payment = new Payment();
        // ...
    } finally {
        long duration = System.currentTimeMillis() - startTime;
        storeHistogram.record(duration);
    }
}
  1. Get the start time,
  2. Wrap the original code in a try-finally construct,
  3. Compute the duration,
  4. Record the duration in the processHistogram histogram,
  5. Do the same for the store method.

4. Add counter

๐Ÿ“ Let's do the same for the counter:

// ...

import io.micrometer.core.instrument.Counter;

@Service
public class PaymentService {
    //...
    private LongCounter requestCounter; // (1)

    public PaymentService(/* ... */) {
        // ...
        requestCounter = openTelemetry.getMeter(EasypayServiceApplication.class.getName()) // (2)
                .counterBuilder("snowcamp.payment.requests")
                .setDescription("Payment requests counter")
                .build();
    }
}
  1. Declares the counter,
  2. Initializes the counter.

๐Ÿ“ The method accept of the PaymentService class is invoked for each payment request, it is a good candidate to increment our counter:

@Transactional(Transactional.TxType.REQUIRED)
public void accept(PaymentProcessingContext paymentContext) {
    requestCounter.add(1); // < Add this (1)
    process(paymentContext);
    store(paymentContext);
    paymentTracker.track(paymentContext);
}
  1. Increment the counter each time the method is invoked.

5. Redeploy easypay

๐Ÿ› ๏ธ Rebuild the easypay-service:

$ docker compose build easypay-service

๐Ÿ› ๏ธ Redeploy easypay:

$ docker compose up -d easypay-service

๐Ÿ› ๏ธ Once easypay is started (you can check logs with the docker compose logs -f easypay-service command and wait for an output like Started EasypayServiceApplication in 32.271 seconds):

$ http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780008 expiryDate=789456123 amount:=40000

๐Ÿ› ๏ธ Then go to Grafana and explore Metrics to find your newly created metrics:

๐Ÿ‘€ Explore them, especially the _bucket one.

When using a Histogram you get several metrics by default, suffixed with:

Especially:

Finally, our Counter becomes a metric suffixed with _total: snowcamp_payment_requests_total.

6. Compute percentiles

Let's compute percentiles for the process and store methods.

As we have seen, the Histogram metric provides the necessary data to compute percentiles, we can query Prometheus to display the percentiles of our application:

๐Ÿ› ๏ธ Go to Grafana, to explore Metrics again.

๐Ÿ› ๏ธ To compute the percentiles for the snowcamp_payment_process histogram we have created:

7. Visualization

๐Ÿ› ๏ธ Go back to Grafana (port 3000), and go into the Dashboards section.

๐Ÿ› ๏ธ We will import the dashboard defined in the docker/grafana/dashboards/easypay-monitoring.json file:

You should be redirected to the Easypay Monitoring dashboard.

It provides some dashboards we have created from the new metrics you exposed in your application:

๐Ÿ› ๏ธ You can generate some load to view your dashboards evolving live:

$ k6 run -u 2 -d 2m k6/01-payment-only.js

In this section, we'll explore distributed tracing, the third pillar of application observability.

Distributed tracing is an essential tool for monitoring and analyzing the performance of complex applications. It tracks the flow of requests across multiple services and components, helping to identify bottlenecks and improve efficiency โ€” particularly useful for intricate systems like Easypay.

With Spring Boot, there are a couple of approaches to incorporate distributed tracing into your application:

For this workshop, we'll keep to use the Java Agent approach, as it's the most straightforward way to instrument our application and independent of the libraries we use.

The OpenTelemetry Collector will be used once again, tasked with receiving traces and forwarding them to the Tempo backend.

Architecture with traces

Lastly, we will use Grafana to examine and interpret these traces, allowing us to better understand and optimize our application's performance.

Enable distributed tracing

To capture the entire transaction across all services in a trace, it's essential to instrument all the services in our application.

We have already configured the OpenTelemetry Java Agent to transmit telemetry data directly to the collector, so our application have already sent traces to the collector.

OpenTelemetry Collector

OpenTelemetry Collector is already configured to accept traces through the OpenTelemetry GRPC protocol (OTLP) on port 4317, and then forward them to Grafana Tempo, which listens on the host tempo on the same port 4317 (this setup specifically handles OTLP traces).

?? You can have a look at the collector configuration located in the docker/otelcol/otelcol.yaml file:

# Input
receivers:
  # OpenTelemetry Protocol: logs, metrics and traces
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

#...
# Output
exporters:
  # ...
  # Export to Tempo via GRPC using the OpenTelemetry Protocol
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

# Telemetry processing pipelines
service:
  pipelines:
    # ...
    # Receive traces using the OpenTelemetry Protocol and export to Tempo
    traces:
        receivers: [ otlp ]
        processors: [ batch ]
        exporters: [ otlp/tempo ]

Explore Traces with Grafana

๐Ÿ› ๏ธ Generate some load on the application to produce traces:

$ k6 run -u 1 -d 5m k6/01-payment-only.js

๐Ÿ› ๏ธ Let's explore your first traces in Grafana:

๐Ÿ‘€ Click on Service Graph and explore the Node graph: this view is extremely helpful for visualizing and understanding how our services communicate with each other.

๐Ÿ‘€ Go back to Search and click on Run query. You should see a table named Table - Traces. By default, this view provides the most recent traces available in Tempo.

๐Ÿ› ๏ธ Let's find an interesting trace using the query builder:

You should see the full stack of the corresponding transaction.

๐Ÿ‘€ Grafana should open a new view (you can enlarge it by clicking on the three vertical dots and selecting Widen pane):

๐Ÿ› ๏ธ Grafana allows to display a graph of spans as interconnected nodes:

๐Ÿ› ๏ธ Continue your exploration in the Search pane:

Sampling

When we instrument our services using the agent, every interaction, including Prometheus calls to the actuator/prometheus endpoint, is recorded.

In the Service Graph you should have seen a link between the User and services other than the api-gateway it seems not normal for us: we only created payments through the api-gateway!

๐Ÿ‘€ If you click on the link and select View traces you should see a lot of traces regarding actuator/health.

To avoid storing unnecessary data in Tempo, we can sample the data in two ways:

In this workshop, we will implement Tail Sampling, using the tail_sampling processor.

Modify the OpenTelemetry Collector configuration file (docker/otelcol/otelcol.yaml) as follows:

# ...
# Transform
processors:
  # ...
  tail_sampling/actuator: # Add a tail sampling definition for
      policies:
        [
          {
            name: "filter-http-url",
            type: "string_attribute",
            string_attribute: {
              key: "http.url",
              values: [ "/actuator/health" ],
              enabled_regex_matching: true,
              invert_match: true
            }
          },
          {
            name: "filter-url-path",
            type: "string_attribute",
            string_attribute: {
              key: "url.path",
              values: [ "/actuator/health" ],
              enabled_regex_matching: true,
              invert_match: true
            }
          }
        ]

# ...
# Telemetry processing pipelines
service:
  # Receive traces using the OpenTelemetry Protocol and export to Tempo
  traces:
    receivers: [ otlp ]
    processors: [ batch, tail_sampling/actuator ] # < add the tail_sampling processor here
    exporters: [ otlp/tempo ]

๐Ÿ› ๏ธ Restart the collector:

$ docker compose restart opentelemetry-collector

Starting from this moment, you should no longer see traces related to actuator/health endpoints.

Custom Traces

Just like metrics, it is also possible to add your own spans on arbitrary methods to provide more business value to the observability of your application.

Let's return to our code!

Objectives

We want to add new spans to the traces generated in the easypay-service application to track payment processing and store events.

To achieve this goal, we will create new spans when the process and store methods of the com.worldline.easypay.payment.control.PaymentService class in the easypay-service module are invoked.

As a reminder, this class is the central component responsible for processing payments. It provides the public method accept, which delegates its responsibilities to two private methods:

1. Add Required Dependencies

We need to add the io.opentelemetry.instrumentation:opentelemetry-instrumentation-annotations dependency to our module to access some useful annotations.

๐Ÿ‘€ This has already been done in advance for this workshop. The following dependencies were added to the Gradle build file (build.gradle.kts) of the easypay-service module:

dependencies {
    //...
    // Add opentelemetry Annotations support
    implementation("io.opentelemetry.instrumentation:opentelemetry-instrumentation-annotations")

    // ...
}

2. Add Custom Spans

๐Ÿ“ To add new spans based on methods, we can simply use the @WithSpan Java annotation. When a traced transaction invokes the annotated method, a new span will be created. Here's how to do it:

// ...

import io.opentelemetry.instrumentation.annotations.WithSpan;

@Service
public class PaymentService {
    // ...

    @WithSpan("Snowcamp: Payment processing method")
    private void process(PaymentProcessingContext context) {
        //...
    }

    @WithSpan("Snowcamp: Payment store method")
    private void store(PaymentProcessingContext context) {
        //...
    }
    
    // ...
}

๐Ÿ“ We can also provide additional information to the span, such as method parameters using the @SpanAttribute annotation:

// ...

import io.opentelemetry.instrumentation.annotations.SpanAttribute;

@Service
public class PaymentService {
    // ...

    @WithSpan("Snowcamp: Payment processing method")
    private void process(@SpanAttribute("context") PaymentProcessingContext context) { // <-- HERE
        // ...
    }

    @WithSpan("Snowcamp: Payment store method")
    private void store(@SpanAttribute("context") PaymentProcessingContext context) { // <-- HERE
        // ...
    }
    // ...
}

This will provide the whole PaymentProcessingContext into the trace.

3. Build and redeploy

๐Ÿ› ๏ธ As we did before:

$ docker compose up -d --build easypay-service

4. Test it!

๐Ÿ› ๏ธ Generate some payments:

$ http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780008 expiryDate=789456123 amount:=40000

๐Ÿ‘€ Go back to Grafana and try to find your new traces using what you've learned previously. Observe the spans you added.

Grafana allows correlation between all our telemetry data:

When discussing observability, correlation is essential. It enables you to diagnose the root cause of an issue quickly by using all the telemetry data involved.

Logs and traces

๐Ÿ› ๏ธ Let's go back to the Grafana Explore dashboard:

๐Ÿ‘€ Now check to see if there are trace_id](https://www.w3.org/TR/trace-context/#trace-id) and span_id attributes. These will help us correlate our different request logs and traces.

Enable correlation

๐Ÿ› ๏ธ In Grafana, go to Connections > Data sources:

๐Ÿ› ๏ธ Go back to the Grafana Explore dashboard and try to find the same kind of log message:

๐Ÿ‘€ Grafana should open a pane with the corresponding trace from Tempo!

Now you can correlate logs and traces!
If you encounter any exceptions in your error logs, you can now see where it happens and get the bigger picture of the transaction from the customer's point of view.

How was it done?

First of all, logs should contain the trace_id information.
Most frameworks handle this for you. Whenever a request generates a trace or span, the value is placed in the MDC ( Mapped Diagnostic Context) and can be printed in the logs.

On the other hand, Grafana has the ability to parse logs to extract certain values for the Loki data source. This is the purpose of Derived fields.

When configuring the Loki data source, we provided Grafana with the trace ID label to use from logs and linked it to the Tempo data source. Behind the scenes, Grafana creates the bridge between the two telemetry data sources. And that's all ๐Ÿ˜Ž

Metrics and Traces (Exemplars)

Exemplars are annotations used in metrics that link specific occurrences, like logs or traces, to data points within a metric time series. They provide direct insights into system behaviors at moments captured by the metric, aiding quick diagnostics by showing related trace data where anomalies occur. This feature is valuable for debugging, offering a clearer understanding of metrics in relation to system events.

๐Ÿ› ๏ธ Generate some load towards the easypay-service:

$ k6 run -u 1 -d 2m k6/01-payment-only.js

โ„น๏ธ Exemplars are tied to a metric and contain a trace ID and a span ID. They are especially available in the OpenMetrics format, but also OpenTelemetry support them. In OpenMetrics format, they appear as follows:

http_server_requests_seconds_bucket{application="easypay-service",error="none",exception="none",instance="easypay-service:39a9ae31-f73a-4a63-abe5-33049b8272ca",method="GET",namespace="local",outcome="SUCCESS",status="200",uri="/actuator/prometheus",le="0.027962026"} 1121 # {span_id="d0cf53bcde7b60be",trace_id="969873d828346bb616dca9547f0d9fc9"} 0.023276118 1719913187.631

The interesting part starts after the # character, this is the so-called exemplar:

               SPAN ID                           TRACE ID                     VALUE      TIMESTAMP
# {span_id="d0cf53bcde7b60be",trace_id="969873d828346bb616dca9547f0d9fc9"} 0.023276118 1719913187.631

That could be translated by:

๐Ÿ‘€ Exemplars can be analyzed in Grafana:

http_server_request_duration_seconds_bucket{http_route="/payments",service_name="easypay-service"}

๐Ÿ‘€ In addition to the line graph, you should see square dots at the bottom of the graph:

Enable correlation

๐Ÿ› ๏ธ In Grafana, go to the Connections > Data sources section:

๐Ÿ› ๏ธ Go back to the Grafana Explore dashboard and try to find the same exemplar as before:

๐Ÿ‘€ Grafana should open a new pane with the corresponding trace from Tempo!

We have added a new correlation dimension to our system between metrics and traces!

How was it done?

Meters such as histograms are updated in the context of a request, and thus of a recorded trace.

When histogram is updated, the OpenTelemetry instrumentation library gets the current trace and span ids in the context, and adds them to the corresponding bucket as an example.

Exemplars are then propagated with the metric to the time series database, which should support them. Hopefully Prometheus can support them: we just had to start the server with the --enable-feature=exemplar-storage flag (see compose.infrastructure.yml) to enable their storage.

Traces to Logs

We are able to link logs to traces thanks to Loki's data source Derived fields, but what about traces to logs?

Fortunately, the Tempo data source can be configured the same way!

Enable correlation

๐Ÿ› ๏ธ In Grafana, go to Connections > Data sources:

โ„น๏ธ Just to give you some context about this configuration:

Test correlation

๐Ÿ› ๏ธ Hit the easypay payment endpoint with curl or k6 to generate some traces (whichever you prefer):

๐Ÿ› ๏ธ Open the Grafana Explore dashboard to find a trace:

๐Ÿ‘€ A new LOG icon should appear for each line of the trace (middle of the screen):

Yeah, we have added a new dimension to the correlation of our telemetry data, further improving our observability.

Traces to Metrics

The last correlation we will explore in today's workshop is between traces and metrics.

Sometimes, we are interested in knowing the state of our application when inspecting traces, such as JVM heap or system CPU usage.

In this section, we will configure the Tempo data source to link our traces to these metrics.

Enable correlation

๐Ÿ› ๏ธ In Grafana, go to Connections > Data sources:

๐Ÿ› ๏ธ Now, we will add some metric queries (click on + Add query for each query):

๐Ÿ› ๏ธ Finally click on Save & test and go back to Grafana Explore dashboard to test our new setup.

Test correlation

๐Ÿ› ๏ธ Hit the easypay payment endpoint with curl or k6 to generate some traces (whichever you prefer):

๐Ÿ› ๏ธ Open the Grafana Explore dashboard to find a trace:

๐Ÿ‘€ The previous LOG icon has been replaced by a link icon (middle of the screen):

This was the last correlation dimension we wanted to show you!

The OpenTelemetry project standardizes telemetry signals, particularly the logs, metrics, and traces we have seen so far.
However, last year they announced their work on a fourth signal: profiling.

Profiling involves measuring the performance characteristics of your application, such as execution time, CPU utilization, or memory usage. It helps identify bottlenecks and optimize resource usage in your application.

If you're familiar with Java, you may already know async_profiler for HotSpot JVMs. It defines itself as a "low overhead sampling profiler for Java."

You may have also heard about eBPF, a technology embedded in the Linux kernel that allows running code in a sandbox within the kernel space. This technology is gaining traction in service meshes and in continuous profiling.

Continuous profiling is an ongoing area of interest in the observability field, aimed at finding additional performance improvements.

(Grafana) Pyroscope

Pyroscope was an open-source project for continuous profiling. It consists of a server that receives profiling samples, which can then be analyzed and displayed as a flamegraph.

In the Java landscape, it offers a Java Agent based on async_profiler, compatible with other agents such as the OpenTelemetry agent. Phew!

In 2023, Grafana acquired Pyroscope and merged it with its own solution, Phlare. Welcome to Grafana Pyroscope!

If you want to know more about Continuous Profiling and what it can bring to you, you may want to check out the Grafana Pyroscope documentation.

Objectives

In this section, we aim to show you:

Enable Continuous Profiling

Start the Pyroscope server

๐Ÿ› ๏ธ We will use the grafana/pyroscope container image: we already defined a pyroscope service in our compose.yml file, but it is not yet enabled. You can start it by enabling the profiling profile:

$ docker compose --profile=profiling up -d

โœ… It should start a new service on port 4040.

๐Ÿ› ๏ธ Go to Grafana Pyroscope dashboard UI on port 4040:

Setup easypay-service for Continuous Profiling

Let's use an agent again to profile our application, and the Pyroscope extension for OpenTelemetry agent to match span with profiling data.

๐Ÿ› ๏ธ First, download the agent. You can use the provided script to download it as instrumentation/pyroscope.jar:

$ bash ./scripts/download-pyroscope-agent.sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 9782k  100 9782k    0     0  10.2M      0 --:--:-- --:--:-- --:--:-- 13.9M
Grafana Pyroscope agent downloaded successfully in ./scripts/../instrumentation

โœ… It should have downloaded both pyroscope.jar and pyroscope-otel.jar in the instrumentation directory.

๐Ÿ“ Just like for logs and metrics, we should modify the compose.yml deployment file for the easypay-service to enable and configure profiling with Pyroscope:

services:
  easypay-service:
    # ...
    volumes:
      - ./instrumentation/opentelemetry-javaagent.jar
      - ./instrumentation/pyroscope.jar:/pyroscope.jar # < Add
      - ./instrumentation/pyroscope-otel.jar:/pyroscope-otel.jar # < Add  
    # ...
    environment:
      # ...
      # Pyroscope agent configuration --vv
      PYROSCOPE_APPLICATION_NAME: easypay-service # (1)
      PYROSCOPE_FORMAT: jfr                       # (2)
      PYROSCOPE_PROFILING_INTERVAL: 10ms    
      PYROSCOPE_PROFILER_EVENT: itimer            # (3)
      PYROSCOPE_PROFILER_LOCK: 10ms               # (4)
      PYROSCOPE_PROFILER_ALLOC: 512k              # (5)
      PYROSCOPE_UPLOAD_INTERVAL: 5s
      OTEL_JAVAAGENT_EXTENSIONS: /pyroscope-otel.jar # (6)
      OTEL_PYROSCOPE_ADD_PROFILE_URL: false
      OTEL_PYROSCOPE_ADD_PROFILE_BASELINE_URL: false
      OTEL_PYROSCOPE_START_PROFILING: true
      PYROSCOPE_SERVER_ADDRESS: http://pyroscope:4040 # (7)
    # ...
    entrypoint:
      - java
      - -javaagent:/pyroscope.jar # < Add
      - -javaagent:/opentelemetry-javaagent.jar
      - -Dotel.instrumentation.logback-appender.experimental-log-attributes=true
      - -Dotel.instrumentation.logback-appender.experimental.capture-mdc-attributes=*
      - -Dotel.metric.export.interval=5000
      - -cp
      - app:app/lib/*
      - com.worldline.easypay.EasypayServiceApplication
  1. Define an application name (this will create the service_name label),
  2. Set format: JFR allows to have multiple events to be recorded,
  3. Type of event to profile: wall allows to record the time spent in methods. Other valid values are itimer and cpu.
  4. Threshold to record lock events,
  5. Threshold to record memory events,
  6. Declare the Pyroscope extension for the OpenTelemetry agent,
  7. Server address.

๐Ÿ› ๏ธ Redeploy easypay-service:

$ docker compose up -d easypay-service

โœ… Check logs for correct startup:

docker compose logs -f easypay-service

You should see additional logs related to Pyroscope.

๐Ÿ‘€ Go back to the Pyroscope dashboard (port 4040):

Grafana setup

๐Ÿ‘€ Let's go to the Grafana Explore dashboard:

๐Ÿ› ๏ธ Generate some load with k6:

$ k6 run -u 1 -d 5m k6/02-payment-smartbank.js

Traces to Profiles correlation

It is possible to link traces to profiles in Grafana Pyroscope, thanks to the Pyroscope extension for the OpenTelemetry Agent. This extensions attaches span context to profiles making possible to correlate traces with profiles.

Configure correlation

๏ธ In Grafana, go to Connections > Data sources:

๐Ÿ› ๏ธ Finally click on Save & test and go back to Grafana Explore dashboard to test our new setup.

Test correlation

๐Ÿ› ๏ธ Hit the easypay payment endpoint with curl or k6 to generate some traces (whichever you prefer):

๐Ÿ› ๏ธ Open the Grafana Explore dashboard to find a trace:

๐Ÿ‘€ In the link icon, at the root of a service, select Related Profile.