This workshop aims to introduce how to make a Java application fully observable with:

During this workshop we will use the Grafana stack and Prometheus:

We will also cover the OpenTelemetry Collector which gathers & broadcasts then the data coming from our microservices.
And as we are talking about Grafana, you will use Grafana Alloy which is one of its implementation (and replaces the deprecated Grafana Agent).

Application High Level Design

The Easy Pay System

API Gateway

Centralises all API calls.

Easy Pay Service

Payment microservices which accepts (or not) payments.

This is how it validates every payment:

  1. Check the POS (Point of Sell) number
  2. Check the credit card number
  3. Check the credit card type
  4. Check the payment threshold, it calls the Smart Bank Gateway for authorization

If the payment is validated, it stores it and broadcasts it to all the other microservices through Kafka.

Fraud detection Service

After fetching a message from the Kafka topic, this service search in its database if the payment's card number is registered for fraud.

In this case, only a WARN log is thrown.

Merchant Back Office Service

For this lab, it only simulates the subscription of messages.

Smart Bank Gateway

This external service authorizes the payment.

Our fully observable platform

The Easy Pay observable System

Short explanation

As mentioned earlier, our observability stack is composed of :

In addition, the microservices are started with an agent to broadcast the traces to the collector.

Icons

You will the following icons during the workshop:

πŸ› οΈ An action to perform,
πŸ“ A file to modify,
πŸ‘€ Something to observe,
βœ… Validate something,
ℹ️ Some information.

Skills

Skill

Level

REST API

proficient

Java

proficient

Gradle

novice

Spring Framework, Boot, Cloud Config, Cloud Gateway

proficient

Docker

novice

Grafana stack

novice

Prometheus

novice

Kafka

novice

Tools

If you want to execute this workshop locally

You MUST have set up these tools first:

πŸ› οΈ Here are commands to validate your environment:

Java

$ java -version

openjdk version "21.0.3" 2024-04-16 LTS
OpenJDK Runtime Environment Temurin-21.0.3+9 (build 21.0.3+9-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.3+9 (build 21.0.3+9-LTS, mixed mode, sharing)

Gradle

πŸ› οΈ If you use the wrapper, you won't have troubles. Otherwise...:

$ gradle -version

------------------------------------------------------------
Gradle 8.7
------------------------------------------------------------

Build time:   2024-03-22 15:52:46 UTC
Revision:     650af14d7653aa949fce5e886e685efc9cf97c10

Kotlin:       1.9.22
Groovy:       3.0.17
Ant:          Apache Ant(TM) version 1.10.13 compiled on January 4 2023
JVM:          21.0.3 (Eclipse Adoptium 21.0.3+9-LTS)
OS:           Linux 5.15.146.1-microsoft-standard-WSL2 amd64

Docker Compose

$ docker compose version
    
Docker Compose version v2.24.7

If you don't want to bother with a local setup

It's strongly recommended to use Gitpod. You must create an account first. You then can open this project in either your local VS Code or directly in your browser:

Open in Gitpod

Open GitPod

We will assume you will use GitPod for this workshop :)

Open in Gitpod

When a messages invites you making an URL public, select and validate it.

Start the infrastructure

The "infrastructure stack" is composed of the following components:

ℹ️ If you run your application on GitPod, the following step are automatically started during the provisioning of your GitPod environment.

πŸ› οΈ Otherwise, to run it on your desktop, execute the following commands

$ bash scripts/download-agent.sh
$ ./gradlew tasks
$ docker compose up -d --build --remove-orphans

βœ… To check if all the services are up, you can run this command:

$ docker compose ps -a

And check the status of every service.

For instance:

❯ docker compose ps
NAME                                           IMAGE                         COMMAND                  SERVICE               CREATED         STATUS                     PORTS
api-gateway                                    api-gateway:latest            "java -javaagent:/ap..."   api-gateway           8 minutes ago   Up 7 minutes (healthy)     0.0.0.0:8080->8080/tcp, :::8080->8080/tcp
config-server                                  config-server:latest          "java -javaagent:/ap..."   config-server         8 minutes ago   Up 7 minutes (healthy)     0.0.0.0:8888->8888/tcp, :::8888->8888/tcp
discovery-server                               discovery-server:latest       "java -javaagent:/ap..."   discovery-server      8 minutes ago   Up 7 minutes (healthy)     0.0.0.0:8761->8761/tcp, :::8761->8761/tcp
easypay-service                                easypay-service:latest        "java -javaagent:/ap..."   easypay-service       8 minutes ago   Up 7 minutes (healthy)     
fraudetect                                     fraudetect-service:latest     "java -javaagent:/ap..."   fraudetect-service    8 minutes ago   Up 7 minutes (healthy)     
kafka                                          confluentinc/cp-kafka:7.6.1   "/etc/confluent/dock..."   kafka                 8 minutes ago   Up 8 minutes (healthy)     9092/tcp, 0.0.0.0:19092->19092/tcp, :::19092->19092/tcp
merchant-backoffice                            merchant-backoffice:latest    "java -javaagent:/ap..."   merchant-backoffice   8 minutes ago   Up 7 minutes (healthy)     
observability-workshop-collector-1             grafana/alloy:latest          "/bin/alloy run --se..."   collector             8 minutes ago   Up 8 minutes               0.0.0.0:4317-4318->4317-4318/tcp, :::4317-4318->4317-4318/tcp, 0.0.0.0:12345->12345/tcp, :::12345->12345/tcp
observability-workshop-grafana-1               grafana/grafana:latest        "sh -xeuc 'mkdir -p ..."   grafana               8 minutes ago   Up 7 minutes               0.0.0.0:3000->3000/tcp, :::3000->3000/tcp
observability-workshop-loki-1                  grafana/loki:latest           "/usr/bin/loki -conf..."   loki                  8 minutes ago   Up 7 minutes               0.0.0.0:3100->3100/tcp, :::3100->3100/tcp
observability-workshop-postgres-easypay-1      postgres:16                   "docker-entrypoint.s..."   postgres-easypay      8 minutes ago   Up 8 minutes (healthy)     0.0.0.0:5432->5432/tcp, :::5432->5432/tcp
observability-workshop-postgres-fraudetect-1   postgres:16                   "docker-entrypoint.s..."   postgres-fraudetect   8 minutes ago   Up 7 minutes (healthy)     0.0.0.0:5434->5432/tcp, :::5434->5432/tcp
observability-workshop-postgres-merchantbo-1   postgres:16                   "docker-entrypoint.s..."   postgres-merchantbo   8 minutes ago   Up 8 minutes (healthy)     0.0.0.0:5435->5432/tcp, :::5435->5432/tcp
observability-workshop-postgres-smartbank-1    postgres:16                   "docker-entrypoint.s..."   postgres-smartbank    8 minutes ago   Up 7 minutes (healthy)     0.0.0.0:5433->5432/tcp, :::5433->5432/tcp
observability-workshop-prometheus-1            prom/prometheus:v2.52.0       "/bin/prometheus --c..."   prometheus            8 minutes ago   Up 8 minutes               0.0.0.0:9090->9090/tcp, :::9090->9090/tcp
observability-workshop-tempo-1                 grafana/tempo:latest          "/tempo -config.file..."   tempo                 8 minutes ago   Up 7 minutes               0.0.0.0:3200->3200/tcp, :::3200->3200/tcp, 0.0.0.0:9095->9095/tcp, :::9095->9095/tcp, 0.0.0.0:9411->9411/tcp, :::9411->9411/tcp, 0.0.0.0:14268->14268/tcp, :::14268->14268/tcp
smartbank-gateway                              smartbank-gateway:latest      "java -Xmx4g -javaag..."   smartbank-gateway     8 minutes ago   Up 7 minutes (unhealthy)   

Validation

βœ… Open the Eureka website started during the infrastructure setup. The following instances should be registered with Eureka:

βœ… All services should be registered before continuing...

πŸ› οΈ You can now access our platform to initiate a payment:

$ http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780008 expiryDate=789456123 amount:=25000

βœ… You should get the following content:

HTTP/1.1 201 Created
Content-Type: application/json
Date: Wed, 05 Jun 2024 13:42:12 GMT
Location: http://172.19.25.95:44523/payments/3cd8df14-8c39-460b-a429-dc113d003aed
transfer-encoding: chunked

{
    "amount": 25000,
    "authorId": "5d364f1a-569c-4c1d-9735-619947ccbea6",
    "authorized": true,
    "bankCalled": true,
    "cardNumber": "5555567898780008",
    "cardType": "MASTERCARD",
    "expiryDate": "789456123",
    "paymentId": "3cd8df14-8c39-460b-a429-dc113d003aed",
    "posId": "POS-01",
    "processingMode": "STANDARD",
    "responseCode": "ACCEPTED",
    "responseTime": 414
}

Some functional issues

One of our customers raised an issue:

Β« When I reach your API, I usually either an AMOUNT_EXCEEDED or INVALID_CARD_NUMBER error. Β»

Normally the first thing to do is checking the logs. Before that, we will reproduce these issues.

πŸ› οΈ You can check the API as following:

$ http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780008 expiryDate=789456123 amount:=51000

HTTP/1.1 201 Created
Content-Type: application/json
Date: Wed, 05 Jun 2024 13:45:40 GMT
Location: http://172.19.25.95:44523/payments/5459b20a-ac91-458f-9578-019c05483bb3
transfer-encoding: chunked

{
    "amount": 51000,
    "authorId": "6ace318f-b669-4e4a-b366-3f09048becb7",
    "authorized": false,
    "bankCalled": true,
    "cardNumber": "5555567898780008",
    "cardType": "MASTERCARD",
    "expiryDate": "789456123",
    "paymentId": "5459b20a-ac91-458f-9578-019c05483bb3",
    "posId": "POS-01",
    "processingMode": "STANDARD",
    "responseCode": "AUTHORIZATION_DENIED",
    "responseTime": 25
}
$  http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780007 expiryDate=789456123 amount:=25000

HTTP/1.1 201 Created
Content-Type: application/json
Date: Wed, 05 Jun 2024 13:46:09 GMT
Location: http://172.19.25.95:44523/payments/2dbf3823-fb11-4c63-a540-ab43ac663e68
transfer-encoding: chunked

{
    "amount": 51000,
    "authorId": null,
    "authorized": false,
    "bankCalled": false,
    "cardNumber": "5555567898780007",
    "cardType": null,
    "expiryDate": "789456123",
    "paymentId": "2dbf3823-fb11-4c63-a540-ab43ac663e68",
    "posId": "POS-01",
    "processingMode": "STANDARD",
    "responseCode": "INVALID_CARD_NUMBER",
    "responseTime": 5
}

πŸ› οΈ It's time to have a look to our easypay-service logs!
Service is started in a Docker container. To get its output you can use the following command:

$ docker compose logs -f easypay-service

πŸ‘€ Look into the console logs to pinpoint these issues.

Let's fix it!

It's time to add more contextual information into our code!

We will use in this workshop SLF4J.

The logger can be created by adding a class variable such as:

  private static final Logger log = LoggerFactory.getLogger(BankAuthorService.class);

Think to use the corresponding class to instantiate it!

What about log levels?

Use the most appropriate log level

The log level is a fundamental concept in logging. Whether the logging framework you use, it allows you to tag log records according to their severity or importance. For instance, SLF4J offers the following log levels by default:

AMOUNT_EXCEEDED issue

πŸ“ Go the easypay-service/src/main/java/com/worldline/easypay/payment/control/bank/BankAuthorService.java class and modify the following code block

@Retry(name = "BankAuthorService", fallbackMethod = "acceptByDelegation")
public boolean authorize(PaymentProcessingContext context) {
 LOG.info("Authorize payment for {}", context);
 try {
  var response = client.authorize(initRequest(context));
  context.bankCalled = true;
  context.authorId = Optional.of(response.authorId());
  context.authorized = response.authorized();
  return context.authorized;
 } catch (Exception e) {
  LOG.warn("Should retry or fallback: {}", e.getMessage());
  throw e;
 }
}

Add a log before the return context.authorized; instruction to provide contextual information such as the authorId and the result of the call (authorized or not).

For the lazy, you can write something like that:

[...]
context.bankCalled = true;
context.authorId = Optional.of(response.authorId());
context.authorized = response.authorized();
LOG.info("Bank answered payment authorization with authorId={}, authorized={}", response.authorId(), response.authorized());
return context.authorized;

INVALID_CARD_NUMBER issue

πŸ“ Go to the easypay-service/src/main/java/com/worldline/easypay/payment/control/PaymentService.java class and modify the following block code in the process method in the same way:

private void process(PaymentProcessingContext context) {
[...]

 if (!cardValidator.checkCardNumber(context.cardNumber)) {
  context.responseCode = PaymentResponseCode.INVALID_CARD_NUMBER;
  return;
 }

[...]

For this error, you can log the error with the following content:

πŸ“ You can also add more logs, especially in the easypay-service/src/main/java/com/worldline/easypay/payment/control/CardValidator.java class.
You may simply uncomment all the commented logs lines (// LOG....), such as:

LOG.warn("Check card number Luhn key does not pass: {}", cardNumber);
LOG.warn("Check card expiry date does not pass: bad format: {}", expiryDate);

You can go further and add as many logs as you think would help in production 😎

Additional logs

We can also log payment requests and responses to provide even more context, which could be helpful for following requests in this workshop.

πŸ“ Modify the easypay-service/src/main/java/com/worldline/easypay/payment/boundary/PaymentResource.java class by uncommenting all // LOG.... lines (keep MDC lines for later πŸ˜‰).

Check your code

πŸ› οΈ You can redeploy the easypay-service with the following commands:

$ docker compose build easypay-service
$ docker compose up -d easypay-service

πŸ› οΈ Now you can run the same commands ran earlier and check again the logs (http POST...).

A technical issue

Another issue was raised for the POS (Point of Sell) POS-02 (but we didn't know yet!).

πŸ› οΈ When you reach the API using this command:

http POST :8080/api/easypay/payments posId=POS-02 cardNumber=5555567898780008 expiryDate=789456123 amount:=25000

πŸ‘€ You should get the following output:

HTTP/1.1 500
[...]

{
  "error":"Internal Server Error",
  "path": "/payments",
  "status": 500,
  [...]
}

πŸ‘€ You then get the following log message:

2024-06-05T15:45:35.215+02:00 ERROR 135386 --- [easypay-service] [o-auto-1-exec-7] o.a.c.c.C.[.[.[/].[dispatcherServlet]    : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed: java.lang.NullPointerException: Cannot invoke "java.lang.Boolean.booleanValue()" because "java.util.List.get(int).active" is null] with root cause

java.lang.NullPointerException: Cannot invoke "java.lang.Boolean.booleanValue()" because "java.util.List.get(int).active" is null
        at com.worldline.easypay.payment.control.PosValidator.isActive(PosValidator.java:34) ~[main/:na]
        at com.worldline.easypay.payment.control.PaymentService.process(PaymentService.java:46) ~[main/:na]
        at com.worldline.easypay.payment.control.PaymentService.accept(PaymentService.java:108) ~[main/:na]
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ~[na:na]
        at java.base/java.lang.reflect.Method.invoke(Method.java:580) ~[na:na]
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:354) ~[spring-aop-6.1.6.jar:6.1.6]
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:196) ~[spring-aop-6.1.6.jar:6.1.6]
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-6.1.6.jar:6.1.6]
    [...]

πŸ“ To find the root cause, add first a smart log entry in the easypay-service/src/main/java/com/worldline/easypay/payment/control/PosValidator.java class. In the isActive() method, catch the exception and trace the error:

public boolean isActive(String posId) {
    PosRef probe = new PosRef();
    probe.posId = posId;
    try {
        List<PosRef> posList = posRefRepository.findAll(Example.of(probe));

        if (posList.isEmpty()) {
            LOG.warn("checkPosStatus NOK, unknown posId {}", posId);
            return false;
        }

        boolean result = posList.get(0).active;

        if (!result) {
            LOG.warn("checkPosStatus NOK, inactive posId {}", posId);
        }
        return result;
    } catch (NullPointerException e) {
        LOG.warn("Invalid value for this POS: {}", posId);
        throw e;
    }
}

This is most likely an issue with some data in the database... Thanks to logs, we may quickly get the idea that the problem comes from the point of sale with ID POS-02.

It is easy to identify the issue if we don't have traffic against our application, but what if we have more realistic traffic?

πŸ› οΈ Generate some load with k6 (a Grafana tool for load testing):

$ k6 run -u 5 -d 30s k6/01-payment-only.js

πŸ‘€ Check again logs:

$ docker compose logs -f easypay-service

Logs are now interleaved due to several requests being executed concurrently. It becomes harder to identify which point of sale is causing this error...

πŸ‘€ If you look into the SQL import file (easypay-service/src/main/resources/db/postgresql/data.sql), you'll notice a NULL value instead of a boolean for the active column.

Using Mapped Diagnostic Context (MDC) to get more insights

πŸ“ Go to the PaymentResource class and modify the method processPayment() to instantiate the MDC:

public ResponseEntity<PaymentResponse> processPayment(PaymentRequest paymentRequest)
  MDC.put("CardNumber",paymentRequest.cardNumber());
  MDC.put("POS",paymentRequest.posId());

  [...]
  MDC.clear();
return httpResponse;

πŸ‘€ Go to the MDC spring profile configuration file (easypay-service/src/main/resources/application-mdc.properties) and check the configuration of both the CardNumber & POSfields.

[...] %clr(CardNumber=){faint}%clr(%X{CardNumber:-null}) %clr(POS=){faint}%clr(%X{POS:-null}) [...] 

πŸ“ Activate the mdc profile in the compose.yml file:

  easypay-service:
    image: easypay-service:latest
    [...]
      SPRING_PROFILES_ACTIVE: default,docker,mdc

πŸ› οΈ Rebuild and redeploy the easypay-service:

$ docker compose build easypay-service
$ docker compose up -d easypay-service

Adding more content in our logs

πŸ› οΈ To have more logs, we will run several HTTP requests using K6. Run the following command:

$ k6 run -u 5 -d 5s k6/01-payment-only.js

πŸ‘€ Check then the logs to pinpoint some exceptions.

Logs Correlation

Let's dive into our logs on Grafana!

Logs are stored in the logs folder. You should find a .log.json file for each service in our application.
We have configured our Spring Boot applications to output logs:

πŸ‘€ You can take a look at log files such as easypay-service.log.json to see their structure: a lot of information is available!

In this section we will use Promtail to broadcast them to Loki through Grafana Alloy (OTEL Collector).
Loki, our storage backend, is listening on host loki and port 3100.

Configure Grafana Alloy collector

πŸ“ Add the following Logging configuration to the docker/alloy/config.alloy file:

////////////////////
// LOGS
////////////////////

// JSON LOG FILES (1)
local.file_match "jsonlogs" {
	path_targets = [{"__path__" = "/logs/*.json", "exporter" = "JSONFILE"}]
}
// (2)
loki.source.file "jsonlogfiles" {
	targets    = local.file_match.jsonlogs.targets
	forward_to = [loki.process.jsonlogs.receiver]
}
// (3)
loki.process "jsonlogs" {
	forward_to = [loki.write.endpoint.receiver]

  // (4)
	stage.json {
		expressions = {
			timestamp   = "timestamp",
			application = "context.properties.applicationName",
			instance    = "context.properties.instance",
      level       = "level",
		}
	}

  // (5)
	stage.labels {
		values = {
			application = "application",
			instance    = "instance",
      level       = "level",
		}
	}

  // (6)
	stage.timestamp {
		source = "timestamp"
		format = "RFC3339"
		fallback_formats = ["UnixMs",]
	}
}

// (7)
// EXPORTER (LOKI)
loki.write "endpoint" {
	endpoint {
		url = "http://loki:3100/loki/api/v1/push"
	}
}

Thanks to this configuration, the JSON files are automatically grabbed and broadcast to Loki. Here is a short summary of the following steps:

  1. List the input file we are interested in (all .json files in the logs directory)
  2. Reads log entries from files defined in (1) and forward them to Alloy's Loki components
  3. Process log entries to parse them, extract data, transform data, and filter log entries
  4. Extract contextual information from JSON log entry, such as the application name
  5. Follow up the previous action by applying labels which will be available in Loki
  6. Extract timestamp from the log file instead of using Promtail scrape time,
  7. Define the output to Loki

πŸ› οΈ Restart the Grafana Alloy collector:

$ docker compose restart collector

πŸ‘€ Open a browser page to Alloy UI (port 12345):

Explore logs with Grafana

πŸ› οΈ Go to Grafana (port 3000):

Grafana offers a form to help you build your queries:

You can also use a dedicated query language to make your queries directly: this is named LogQL.

πŸ› οΈ Let's get logs from the easypay-service:

Now we want to display logs in a table format with the timestamp, the log level and the message.
In order to do that, we should extract the message from the JSON as it is not done automatically.

πŸ› οΈ Modify the query such as:

πŸ› οΈ Time to format:

πŸ› οΈ Do not hesitate to hit the easypay payment endpoint with curl/httpie or k6 to generate some logs (whichever you prefer):

http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780008 expiryDate=789456123 amount:=40000
# OR
k6 run -u 1 -d 2m k6/01-payment-only.js

πŸ‘€ You can also view logs for the other services (e.g., api-gateway).

πŸ‘€ You can also search logs based on the correlation ID.

Maybe another issue? Do you see the card numbers? 😨

Personal Identifiable Information (PII) obfuscation

For compliance and to prevent personal data loss, we will obfuscate the card number in the logs.

πŸ› οΈ In the Alloy configuration file (docker/alloy/config.alloy), add a luhn stage at the end of the loki.process "jsonlogs" block definition:

loki.process "jsonlogs" {
  //...
  stage.luhn {
    min_length  = 13
    replacement = "**MASKED**"
  }
}

πŸ› οΈ Restart the collector:

$ docker compose restart collector

πŸ› οΈ Generate some logs with curl/httpie or k6.

βœ… Check the card numbers are now obfuscated with the **MASKED** content.

Let's take control of our application's metrics!

Metrics exposed by the application

πŸ› οΈ Check out the easypay-service metrics definitions exposed by Spring Boot Actuator first:

$ http :8080/actuator/metrics

πŸ‘€ Explore the output.

πŸ› οΈ Now get the Prometheus metrics using this command:

$ http :8080/actuator/prometheus

This is an endpoint exposed by Actuator to let the Prometheus server get your application metrics.

How are metrics scraped?

πŸ‘€ Check out the Prometheus (docker/prometheus/prometheus.yml) configuration file. All the scraper's definitions are configured here.

For instance, here is the configuration of the config-server:

  - job_name: prometheus-config-server
    scrape_interval: 5s
    scrape_timeout: 5s
    metrics_path: /actuator/prometheus
    static_configs:
      - targets:
          - config-server:8890

You can see it uses the endpoint we looked into earlier under the hood. But it is a static configuration: we should tell Prometheus where to look for metrics...

Hopefully, Prometheus is also able to query our discovery-server (Eureka service discovery) for discovering what are all the plugged services of our application. It will then scrape them in the same way:

  # Discover targets from Eureka and scrape metrics from them
  - job_name: eureka-discovery
    scrape_interval: 5s
    scrape_timeout: 5s
    eureka_sd_configs:
      - server: http://discovery-server:8761/eureka (1)
        refresh_interval: 5s
    relabel_configs: (2)
      - source_labels: [__meta_eureka_app_instance_metadata_metrics_path]
        target_label: __metrics_path__
  1. We plugged Prometheus to our Eureka discovery-server to explore all the metrics of the underlying systems
  2. Configuration allows additional operations, such as relabelling the final metric before storing it into Prometehus

πŸ‘€ You can have an overview of all the scraped applications on the Prometheus dashboard:

Add scrape configuration for our easypay service

πŸ“ Modify the docker/prometheus/prometheus.yml file to add a new configuration to scrape the easypay service. You can use the prometheus-config-server configuration as a model:

That should give you the following yaml configuration:

  - job_name: prometheus-easypay-service
    scrape_interval: 5s
    scrape_timeout: 5s
    metrics_path: /actuator/prometheus
    static_configs:
      - targets:
          - easypay-service:8080

πŸ› οΈ Restart the Prometheus server to take into account this new configuration:

$ docker compose restart prometheus

πŸ‘€ Now explore again the targets (Status > Targets) on the Prometheus dashboard (port 9090).

Let's explore the metrics

πŸ› οΈ Go to Grafana and start again an Explore dashboard.

πŸ› οΈ Select the Prometheus datasource instead of the Loki one.

In this section you will hand on the metrics query builder of Grafana.

The Metric field lists all the metrics available in the Prometheus server: take time to explore them.

πŸ› οΈ For example, you can select the metric named jvm_memory_used_bytes, and click on the Run query button to plot the memory usage of all your services by memory area,

πŸ› οΈ If you want to plot the total memory usage of your services:

πŸ› οΈ You can also filter metrics to be displayed using Label filters: try to create a filter to display only the metric related to the application named easypay-service.

Dashboards

With Grafana, you can either create your own dashboards or import some provided by the community from Grafana's website.

We will choose the second solution right now and import the following dashboards:

πŸ› οΈ To import these dashboards:

πŸ‘€ Explore the JVM Micrometer dashboard: it works almost out of box.
It contains a lot of useful information about JVMs running our services.

The application filter (top of the dashboard) let you select the service you want to explore metrics.

Incident!

πŸ› οΈ Now let's simulate some traffic using Grafana K6. Run the following command:

$ k6 run -u 5 -d 2m k6/02-payment-smartbank.js

πŸ‘€ Go back to the Grafana dashboard, click on Dashboards and select JVM Micrometer:

We were talking about an incident, isn't it?

πŸ‘€ Let's go back to the Explore view of Grafana, select Loki as a data source and see what happens!

πŸ› οΈ Create a query with the following parameters to get error logs of the smartbank-gateway service:

πŸ› οΈ Click on Run query and check out the logs.

Normally you would get a java.lang.OutOfMemoryError due to a saturated Java heap space.

πŸ‘€ To get additional insights, you can go back to the JVM dashboard and select the smartbank-gateway application.

Normally you will see the used JVM Heap reaching the maximum allowed.

Business metrics

Observability is not only about incidents. You can also define your own metrics.

Micrometer, the framework used by Spring Boot Actuator to expose metrics, provides an API to create your own metrics quite easily:

Let's go back to our code!

Objectives

We want to add new metrics to the easypay service to measure they payment processing and store time.
So we target a metric of Timer type.

In order to achieve this goal, we will measure the time spent in the two methods process and store of the com.worldline.easypay.payment.control.PaymentService class of the easypay-service module.
This class is the central component responsible for processing payments: it provides the accept public method, which delegates its responsibility to two private ones:

We also want to count the number of payment requests processed by our system. We will use a metric of Counter type.

You can take a look at the Micrometer's documentation about Timers.

1. Declare the timers

We need to declare two timers in our code:

πŸ“ Let's modify the com.worldline.easypay.payment.control.PaymentService class to declare them:

// ...
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;

@Service
public class PaymentService {
    // ...
    private Timer processTimer; (1)
    private Timer storeTimer;

    public PaymentService(
            // ...,
            MeterRegistry meterRegistry) { (2)
        // ...

        processTimer = Timer (3)
                .builder("rivieradev.payment.process") (4)
                .description("Payment processing time") (5)
                .register(meterRegistry); (6)
        storeTimer = Timer
                .builder("rivieradev.payment.store")
                .description("Payment store time")
                .register(meterRegistry);
    }
  1. Declare the two timers,
  2. Injects the MeterRegistry provided by Spring Boot Actuator in the class constructor, as it is required to initialize the timers,
  3. Initialize the two timers by giving them a name (4), a description (5) and adding them to the meter registry.

2. Record time spent in the methods

The Timer API allows to record blocks of code. It's going to be our way:

timer.record(() -> {
  // some code
});

πŸ“ Let's modify our process and store methods to record our latency with the new metrics.
We can simply wrap our original code in a Java Runnable functional interface:

    // ...
    private void process(PaymentProcessingContext context) {
        processTimer.record(() -> { (1)
            if (!posValidator.isActive(context.posId)) {
                context.responseCode = PaymentResponseCode.INACTIVE_POS;
                return;
            }
            // ...
        });
    }

    private void store(PaymentProcessingContext context) {
        storeTimer.record(() -> { (2)
            Payment payment = new Payment();
            // ...
        });
    }
  1. Modify the process method to wrap all its content into a Runnable consumed by the record method of our processTimer timer,
  2. Do the same for the store method.

3. Add counter

πŸ“ Let's do the same for the counter:

// ...
import io.micrometer.core.instrument.Counter;

@Service
public class PaymentService {
    //...
    private Counter requestCounter; (1)

    public PaymentService(
            //...
            ) {
        // ...
        requestCounter = Counter (2)
                .builder("rivieradev.payment.requests")
                .description("Payment requests counter")
                .register(meterRegistry);
    }
  1. Declares the counter,
  2. Initializes the counter.

πŸ“ The method accept of the PaymentService class is invoked for each payment request, it is a good candidate to increment our counter:

    @Transactional(Transactional.TxType.REQUIRED)
    public void accept(PaymentProcessingContext paymentContext) {
        requestCounter.increment(); (1)
        process(paymentContext);
        store(paymentContext);
        paymentTracker.track(paymentContext);
    }
  1. Increment the counter each time the method is invoked.

4. Redeploy easypay

πŸ› οΈ Rebuild the easypay-service:

$ docker compose build easypay-service

πŸ› οΈ Redeploy easypay:

$ docker compose up -d easypay-service

πŸ› οΈ Once easypay is started (you can check logs with the docker compose logs -f easypay-service command and wait for an output like Started EasypayServiceApplication in 32.271 seconds):

$ http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780008 expiryDate=789456123 amount:=40000
$ docker compose exec -it easypay-service sh
/ $ curl http://localhost:8080/actuator/prometheus | grep riviera

You should get this kind of output:

# HELP rivieradev_payment_process_seconds Payment processing time
# TYPE rivieradev_payment_process_seconds summary
rivieradev_payment_process_seconds_count{application="easypay-service",...} 4
rivieradev_payment_process_seconds_sum{application="easypay-service",...} 1.984019362
# HELP rivieradev_payment_process_seconds_max Payment processing time
# TYPE rivieradev_payment_process_seconds_max gauge
rivieradev_payment_process_seconds_max{application="easypay-service",...} 1.927278528
# HELP rivieradev_payment_store_seconds Payment store time
# TYPE rivieradev_payment_store_seconds summary
rivieradev_payment_store_seconds_count{application="easypay-service",...} 4
rivieradev_payment_store_seconds_sum{application="easypay-service",...} 0.299205989
# HELP rivieradev_payment_store_seconds_max Payment store time
# TYPE rivieradev_payment_store_seconds_max gauge
rivieradev_payment_store_seconds_max{application="easypay-service",...} 0.291785969
# HELP rivieradev_payment_requests_total Payment requests counter
# TYPE rivieradev_payment_requests_total counter
rivieradev_payment_requests_total{application="easypay-service",...} 4.0

When using a Timer you get three metrics by default, suffixed by:

Especially we can get the average time spent in the method by dividing the sum by the count.

Finally, our Counter is the last metric suffixed with _total.

5. Add histograms and percentiles

As we are talking about latencies, you may be also interested in histograms to get the distribution of the events per buckets or percentiles values (the famous 0.99, 0.999...). Fortunately, Timers allow to compute Histograms and Percentiles!

πŸ“ Modify the two timers as follows:

// ...
@Service
public class PaymentService {
    // ...
    public PaymentService(
            // ...
            ) {
        // ...

        processTimer = Timer
                .builder("rivieradev.payment.process")
                .description("Payment processing time")
                .publishPercentileHistogram() (1)
                .publishPercentiles(0.5, 0.90, 0.95, 0.99, 0.999) (2)
                .register(meterRegistry);
        storeTimer = Timer
                .builder("rivieradev.payment.store")
                .description("Payment store time")
                .publishPercentileHistogram()
                .publishPercentiles(0.5, 0.90, 0.95, 0.99, 0.999)
                .register(meterRegistry);
    }
  1. Configures the Timer to publish a histogram allowing to compute aggregable percentiles server-side,
  2. Exposes percentiles value computed from the application: these value are not aggregable!

πŸ› οΈ Repeat the step 4 (redeploy easypay-service and get Prometheus metrics in-container).

You should get way more metrics, especially a new one type suffixed with _bucket:

# HELP rivieradev_payment_process_seconds Payment processing time
# TYPE rivieradev_payment_process_seconds histogram
rivieradev_payment_process_seconds_bucket{application="easypay-service",instance="easypay-service:a44149cd-937a-4e96-abc2-0770343e49bc",namespace="local",le="0.001"} 0
// ...
rivieradev_payment_process_seconds_bucket{application="easypay-service",instance="easypay-service:a44149cd-937a-4e96-abc2-0770343e49bc",namespace="local",le="30.0"} 0
rivieradev_payment_process_seconds_bucket{application="easypay-service",instance="easypay-service:a44149cd-937a-4e96-abc2-0770343e49bc",namespace="local",le="+Inf"} 0
rivieradev_payment_process_seconds_count{application="easypay-service",instance="easypay-service:a44149cd-937a-4e96-abc2-0770343e49bc",namespace="local"} 0
rivieradev_payment_process_seconds_sum{application="easypay-service",instance="easypay-service:a44149cd-937a-4e96-abc2-0770343e49bc",namespace="local"} 0.0
# HELP rivieradev_payment_process_seconds_max Payment processing time
# TYPE rivieradev_payment_process_seconds_max gauge
rivieradev_payment_process_seconds_max{application="easypay-service",instance="easypay-service:a44149cd-937a-4e96-abc2-0770343e49bc",namespace="local"} 0.0

ℹ️ Each bucket contains the number of event which lasts less than the value defined in the le tag.

6. Visualization

πŸ› οΈ Go back to Grafana (port 3000), and go into the Dashboards section.

πŸ› οΈ We will import the dashboard defined in the docker/grafana/dashboards/easypay-monitoring.json file:

You should be redirected to the Easypay Monitoring dashboard.

It provides some dashboards we have created from the new metrics you exposed in your application:

πŸ› οΈ You can generate some load to view your dashboards evolving live:

$ k6 run -u 2 -d 2m k6/01-payment-only.js

In this section, we'll explore distributed tracing, the third pillar of application observability.

Distributed tracing is an essential tool for monitoring and analyzing the performance of complex applications. It tracks the flow of requests across multiple services and components, helping to identify bottlenecks and improve efficiency β€” particularly useful for intricate systems like Easypay.

With Spring Boot, there are a couple of approaches to incorporate distributed tracing into your application:

For this workshop, we'll use the Java Agent method and, with a focus on Grafana, we will employ their version of the OpenTelemetry Java Agent.

The Grafana Alloy collector will be used once again, tasked with receiving traces and forwarding them to the Tempo backend.

Lastly, we will use Grafana to examine and interpret these traces, allowing us to better understand and optimize our application's performance.

Enable distributed tracing

To capture the entire transaction across all services in a trace, it's essential to instrument all the services in our application.

Download Grafana Opentelemetry Java Agent

If you're using GitPod, the Java Agent should already be available in the instrumentation/grafana-opentelemetry-java.jar directory.

πŸ› οΈ If you are participating in this workshop on your workstation, or if the file is missing, you can run the following script to download it:

$ bash -x scripts/download-agent.sh

Enable Java Agent

πŸ“ Since we are deploying the easypay-service using Docker, we need to modify the last lines of the easypay-service/src/main/docker/Dockerfile:

# ...
USER javauser

# Copy Java Agent into the container
COPY instrumentation/grafana-opentelemetry-java.jar /app/grafana-opentelemetry-java.jar

# Add the -javagent flag to setup the JVM to start with our Java Agent
ENTRYPOINT ["java", "-javaagent:/app/grafana-opentelemetry-java.jar", "-cp","app:app/lib/*","com.worldline.easypay.EasypayServiceApplication"] # (2)

The ENTRYPOINT instruction specifies the default command that will be executed when the container starts.

πŸ› οΈ You can now build the updated easypay-service container image:

$ docker compose build easypay-service

Configure Grafana Alloy

It's time to set up Grafana Alloy for handling telemetry data. We will configure it to accept traces through the OpenTelemetry GRPC protocol (OTLP) on port 4317, and then forward them to Grafana Tempo, which listens on the host tempo on the same port 4317 (this setup specifically handles OTLP traces).

πŸ“ Please add the following configuration to the docker/alloy/config.alloy file:

// ...

// RECEIVER SETUP (OTLP GRPC) (1)
otelcol.receiver.otlp "default" {
	grpc {
		endpoint = "0.0.0.0:4317"
	}

	output {
		traces  = [otelcol.processor.batch.default.input]
	}
}

// BATCH PROCESSING FOR OPTIMIZATION (2)
otelcol.processor.batch "default" {
	output {
		traces  = [otelcol.exporter.otlp.tempo.input]
	}
}

// TRACE EXPORTING TOβ€―TEMPO (OTLP) (3)
otelcol.exporter.otlp "tempo" {
	client {
		endpoint = "tempo:4317"

		tls {
			insecure = true
		}
	}
}
  1. Setting up the otelcol.receiver.otlp receiver to accept telemetry data over the OTEL protocol via GRPC, listening on port 4317,
  2. Configuring the processor to batch traces efficiently, reducing resource usage,
  3. Establishing the otelcol.exporter.otlp exporter to send collected telemetry data to the Grafana Tempo service.

ℹ️ The Grafana OpenTelemetry Java Agent is pre-configured to transmit telemetry data directly to the collector. This setup is facilitated through environment variables specified in the compose.yml file:

services:
  # ...
  easypay-service:
    # ..
    environment:
      # ...
      OTEL_SERVICE_NAME: easypay-service (1)
      OTEL_EXPORTER_OTLP_ENDPOINT: http://collector:4317 (2)
      OTEL_EXPORTER_OTLP_PROTOCOL: grpc (3)
    # ...
  1. OTEL_SERVICE_NAME defines a service name which will be attached to traces to identify the instrumented service,
  2. OTEL_EXPORTER_OTLP_ENDPOINT environment variable configures where the telemetry data should be sent,
  3. OTEL_EXPORTER_OTLP_PROTOCOL sets the OTLP protocol used behind, here GRPC (can be HTTP).

πŸ› οΈ To apply the new settings, restart Grafana Alloy with the following command:

$ docker compose restart collector

βœ… After restarting, verify that Grafana Alloy is up and running with the updated configuration by accessing the Alloy dashboard on port 12345.

πŸ› οΈ Redeploy the updated easypay-service:

$ docker compose up -d easypay-service

βœ… To ensure easypay-service has started up correctly, check its logs with:

$ docker compose logs -f easypay-service

βœ… If Java agent was correctly taken into account, logs should start with:

easypay-service  | OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended
easypay-service  | [otel.javaagent 2024-07-04 15:50:40:219 +0000] [main] INFO io.opentelemetry.javaagent.tooling.VersionLogger - opentelemetry-javaagent - version: 2.4.0
easypay-service  | [otel.javaagent 2024-07-04 15:50:40:599 +0000] [main] INFO com.grafana.extensions.instrumentations.TestedInstrumentationsContext - Grafana OpenTelemetry Javaagent: version=2.4.0-beta.1, includeAllInstrumentations=true, useTestedInstrumentations=false, includedUntestedInstrumentations=[], excludedInstrumentations=[]
easypay-service  | [otel.javaagent 2024-07-04 15:50:41:114 +0000] [main] INFO io.opentelemetry.sdk.resources.Resource - Attempting to merge Resources with different schemaUrls. The resulting Resource will have no schemaUrl assigned. Schema 1: https://opentelemetry.io/schemas/1.24.0 Schema 2: https://opentelemetry.io/schemas/1.23.1

Explore Traces with Grafana

πŸ› οΈ Generate some load on the application to produce traces:

$ k6 run -u 1 -d 5m k6/01-payment-only.js

πŸ› οΈ Let's explore your first traces in Grafana:

πŸ‘€ Click on Service Graph and explore the Node graph: this view is extremely helpful for visualizing and understanding how our services communicate with each other.

πŸ‘€ Go back to Search and click on Run query. You should see a table named Table - Traces. By default, this view provides the most recent traces available in Tempo.

πŸ› οΈ Let's find an interesting trace using the query builder:

You should see the full stack of the corresponding transaction.

πŸ‘€ Grafana should open a new view (you can enlarge it by clicking on the three vertical dots and selecting Widen pane):

πŸ› οΈ Grafana allows to display a graph of spans as interconnected nodes:

πŸ› οΈ Continue your exploration in the Search pane:

Sampling

When we instrument our services using the agent, every interaction, including Prometheus calls to the actuator/prometheus endpoint, is recorded.

In the Service Graph you should have seen a link between the User and services other than the api-gateway it seems not normal for us: we only created payments through the api-gateway!

πŸ‘€ If you click on the link and select View traces you should see lot of traces regarding actuator/prometheus.

To avoid storing unnecessary data in Tempo, we can sample the data in two ways:

In this workshop, we will implement Tail Sampling.

Modify the Alloy configuration file (docker/alloy/config.alloy) as follows:

// ...
// RECEIVER (OTLP)
otelcol.receiver.otlp "default" {
	grpc {
		endpoint = "0.0.0.0:4317"
	}

	output {
		traces  = [otelcol.processor.tail_sampling.actuator.input] // (1)
	}
}

// TAIL SAMPLING (2)
otelcol.processor.tail_sampling "actuator" {
  // Filter on http.url attribute (3)
	policy {
		name = "filter_http_url"
		type = "string_attribute"
		string_attribute {
			key = "http.url"
			values = ["/actuator/health", "/actuator/prometheus"]
			enabled_regex_matching = true
			invert_match = true
		}
	}

  // Filter on url.path attribute (3)
	policy {
		name = "filter_url_path"
		type = "string_attribute"
		string_attribute {
			key = "url.path"
			values = ["/actuator/health", "/actuator/prometheus"]
			enabled_regex_matching = true
			invert_match = true
		}
	}

	output {
		traces = [otelcol.processor.batch.default.input] // (4)
	}
}
  1. Modify the output of the otelcol.receiver.otlp to export traces to the otelcol.processor.tail_sampling component defined just after.
  2. Create a new otelcol.processor.tail_sampling component.
  3. Configure it with two policies based on span attributes.
  4. Export non-filtered spans to the otelcol.processor.batch processor we defined previously.

This configuration will filter the SPANs created from /actuator API calls.

πŸ› οΈ Restart the Alloy collector:

$ docker compose restart collector

Starting from this moment, you should no longer see traces related to actuator/health or actuator/prometheus endpoints.

Custom Traces

Just like metrics, it is also possible to add your own spans on arbitrary methods to provide more business value to the observability of your application.

Let's return to our code!

Objectives

We want to add new spans to the traces generated in the easypay-service application to track payment processing and store events.

To achieve this goal, we will create new spans when the process and store methods of the com.worldline.easypay.payment.control.PaymentService class in the easypay-service module are invoked.

As a reminder, this class is the central component responsible for processing payments. It provides the public method accept, which delegates its responsibilities to two private methods:

1. Add Required Dependencies

We need to add the io.opentelemetry.instrumentation:opentelemetry-instrumentation-annotations dependency to our module to access some useful annotations.

πŸ‘€ This has already been done in advance for this workshop. The following dependencies were added to the Gradle build file (build.gradle.kts) of the easypay-service module:

dependencies {
  //...

	// Add opentelemetry support
	implementation(platform("io.opentelemetry:opentelemetry-bom:1.38.0"))
	implementation("io.opentelemetry:opentelemetry-api")
	implementation("io.opentelemetry.instrumentation:opentelemetry-instrumentation-annotations:2.5.0")

  // ...
}

2. Add Custom Spans

πŸ“ To add new spans based on methods, we can simply use the @WithSpan Java annotation. When a traced transaction invokes the annotated method, a new span will be created. Here's how to do it:

// ...
import io.opentelemetry.instrumentation.annotations.WithSpan;

@Service
public class PaymentService {
    // ...

    @WithSpan("RivieraDev: Payment processing method")
    private void process(PaymentProcessingContext context) {
        //...
    }

    @WithSpan("RivieraDev: Payment store method")
    private void store(PaymentProcessingContext context) {
        //...
    }

πŸ“ We can also provide additional information to the span, such as method parameters using the @SpanAttribute annotation:

// ...
import io.opentelemetry.instrumentation.annotations.SpanAttribute;

@Service
public class PaymentService {
    // ...
    
    @WithSpan("RivieraDev: Payment processing method")
    private void process(@SpanAttribute("context") PaymentProcessingContext context) { // <-- HERE
        // ...
    }

    @WithSpan("RivieraDev: Payment store method")
    private void store(@SpanAttribute("context") PaymentProcessingContext context) { // <-- HERE
        // ...
    }

This will provide the whole PaymentProcessingContext into the trace.

3. Build and redeploy

πŸ› οΈ As we did before:

$ docker compose build easypay-service
$ docker compose up -d easypay-service

4. Test it!

πŸ› οΈ Generate some payments:

$ http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780008 expiryDate=789456123 amount:=40000

πŸ‘€ Go back to Grafana and try to find your new traces using what you've learned previously. Observe the spans you added.

Grafana allows correlation between all our telemetry data:

When discussing observability, correlation is essential. It enables you to diagnose the root cause of an issue quickly by using all the telemetry data involved.

Logs and traces

πŸ› οΈ Let's go back to the Grafana Explore dashboard:

πŸ‘€ Now check to see if there is a mdc JSON element that includes both the trace_id and span_id. These will help us correlate our different request logs and traces.

Enable correlation

πŸ› οΈ In Grafana, go to Connections > Data sources:

βœ… To validate the configuration, you can put an example log message in this view:

{"sequenceNumber":0,"timestamp":1719910676210,"nanoseconds":210150314,"level":"INFO","threadName":"kafka-binder-health-1","loggerName":"org.apache.kafka.clients.NetworkClient","context":{"name":"default","birthdate":1719907122576,"properties":{"applicationName":"easypay-service","instance":"easypay-service:9a2ac3f0-c41e-4fcd-8688-123993f1d5db"}},"mdc": {"trace_id":"8b277041692baa8167de5c67977d6571","trace_flags":"01","span_id":"13ff9e44be450b8e"},"message":"[Consumer clientId=consumer-null-1, groupId=null] Node -1 disconnected.","throwable":null}

It should display a table:

πŸ› οΈ Go back to the Grafana Explore dashboard and try to find the same kind of log message:

πŸ‘€ Grafana should open a pane with the corresponding trace from Tempo!

Now you can correlate logs and traces!
If you encounter any exceptions in your error logs, you can now see where it happens and get the bigger picture of the transaction from the customer's point of view.

How was it done?

First of all, logs should contain the trace_id information.
Most frameworks handle this for you. Whenever a request generates a trace or span, the value is placed in the MDC (Mapped Diagnostic Context) and can be printed in the logs.

On the other hand, Grafana has the ability to parse logs to extract certain values for the Loki data source. This is the purpose of Derived fields.

When configuring the Loki data source, we provided Grafana with the regex to extract the trace ID from logs and linked it to the Tempo data source. Behind the scenes, Grafana creates the bridge between the two telemetry data sources. And that's all 😎

Metrics and Traces (Exemplars)

Exemplars are annotations used in metrics that link specific occurrences, like logs or traces, to data points within a metric time series. They provide direct insights into system behaviors at moments captured by the metric, aiding quick diagnostics by showing related trace data where anomalies occur. This feature is valuable for debugging, offering a clearer understanding of metrics in relation to system events.

πŸ› οΈ Generate some load towards the easypay-service:

$ k6 run -u 1 -d 2m k6/01-payment-only.js

πŸ‘€ Now, let's see how exemplars are exported by our service:

$ docker compose exec -it easypay-service sh
$ curl http://localhost:8080/actuator/metrics -H 'Accept: application/openmetrics-text' | grep 'trace_id'

You should obtain metrics with the following format:

http_server_requests_seconds_bucket{application="easypay-service",error="none",exception="none",instance="easypay-service:39a9ae31-f73a-4a63-abe5-33049b8272ca",method="GET",namespace="local",outcome="SUCCESS",status="200",uri="/actuator/prometheus",le="0.027962026"} 1121 # {span_id="d0cf53bcde7b60be",trace_id="969873d828346bb616dca9547f0d9fc9"} 0.023276118 1719913187.631

The interesting part starts after the # character, this is the so-called exemplar:

               SPAN ID                           TRACE ID                     VALUE      TIMESTAMP
# {span_id="d0cf53bcde7b60be",trace_id="969873d828346bb616dca9547f0d9fc9"} 0.023276118 1719913187.631

That could be translated by:

πŸ‘€ Exemplars can be analyzed in Grafana:

http_server_requests_seconds_count{application="easypay-service", uri="/payments"}

πŸ‘€ In addition to the line graph, you should see square dots at the bottom of the graph:

Enable correlation

πŸ› οΈ In Grafana, go to the Connections > Data sources section:

πŸ› οΈ Go back to the Grafana Explore dashboard and try to find the same exemplar as before:

πŸ‘€ Grafana should open a new pane with the corresponding trace from Tempo!

We have added a new correlation dimension to our system between metrics and traces!

How was it done?

Regardless of the integration in Grafana and the setup in the Prometheus data source, we had to configure our application to make Micrometer expose exemplars with trace identifiers.

Firstly, we added the following dependencies to our application (easypay-service/build.gradle.kts), especially the io.prometheus:prometheus-metrics-tracer-otel-agent which provides the necessary classes:

dependencies {
  // ...
  implementation(platform("io.opentelemetry:opentelemetry-bom:1.38.0"))
  implementation("io.opentelemetry:opentelemetry-api")
  implementation("io.prometheus:prometheus-metrics-tracer-otel-agent:1.3.1")
}

Then, in our Spring Boot application, we added a new class, annotated with @Configuration, that provides an OpenTelemetryAgentSpanContext bean. This bean is able to retrieve the current trace ID from distributed tracing. Here is the implementation of com.worldline.easypay.config.PrometheusRegistryConfiguration.java:

package com.worldline.easypay.config;

import org.springframework.boot.autoconfigure.condition.ConditionalOnClass;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import io.prometheus.metrics.tracer.otel_agent.OpenTelemetryAgentSpanContext;

@Configuration // (1)
public class PrometheusRegistryConfiguration {
    
    @Bean
    @ConditionalOnClass(name="io.opentelemetry.javaagent.shaded.io.opentelemetry.api.trace.Span") // (2)
    public OpenTelemetryAgentSpanContext exemplarConfigSupplier() {
        return new OpenTelemetryAgentSpanContext(); // (3)
    }
}
  1. Declare the class as a configuration provider for Spring,
  2. The bean injection is enabled only if the class is present (e.g., when the Java agent is attached). Otherwise, the application will not start due to missing classes,
  3. Inject the OpenTelemetryAgentSpanContext, which will be used by the Micrometer Prometheus registry to export exemplars based on the trace ID.

At this point, your application is ready to export exemplars with metrics once an OpenTelemetry Java agent is attached to the JVM!

Traces to Logs

We are able to link logs to traces thanks to Loki's data source Derived fields, but what about traces to logs?

Fortunately, the Tempo data source can be configured the same way!

Enable correlation

πŸ› οΈ In Grafana, go to Connections > Data sources:

ℹ️ Just to give you some context about this configuration:

Test correlation

πŸ› οΈ Hit the easypay payment endpoint with curl or k6 to generate some traces (whichever you prefer):

πŸ› οΈ Open the Grafana Explore dashboard to find a trace:

πŸ‘€ A new LOG icon should appear for each line of the trace (middle of the screen):

Yeah, we have added a new dimension to the correlation of our telemetry data, further improving our observability.

Traces to Metrics

The last correlation we will explore in today's workshop is between traces and metrics.

Sometimes, we are interested in knowing the state of our application when inspecting traces, such as JVM heap or system CPU usage.

In this section, we will configure the Tempo data source to link our traces to these metrics.

Enable correlation

πŸ› οΈ In Grafana, go to Connections > Data sources:

πŸ› οΈ Now, we will add some metric queries (click on + Add query for each query):

πŸ› οΈ Finally click on Save & test and go back to Grafana Explore dashboard to test our new setup.

Test correlation

πŸ› οΈ Hit the easypay payment endpoint with curl or k6 to generate some traces (whichever you prefer):

πŸ› οΈ Open the Grafana Explore dashboard to find a trace:

πŸ‘€ The previous LOG icon has been replaced by a link icon (middle of the screen):

This was the last correlation dimension we wanted to show you!

The OpenTelemetry project standardizes telemetry signals, particularly the logs, metrics, and traces we have seen so far.
However, just a few months ago, they announced their work on a fourth signal: profiling.

Profiling involves measuring the performance characteristics of your application, such as execution time, CPU utilization, or memory usage. It helps identify bottlenecks and optimize resource usage in your application.

If you're familiar with Java, you may already know async_profiler for HotSpot JVMs. It defines itself as a "low overhead sampling profiler for Java."

You may have also heard about eBPF, a technology embedded in the Linux kernel that allows running code in a sandbox within the kernel space. This technology is gaining traction in service meshes and in continuous profiling.

Continuous profiling is an ongoing area of interest in the observability field, aimed at finding additional performance improvements.

(Grafana) Pyroscope

Pyroscope was an open-source project for continuous profiling. It consists of a server that receives profiling samples, which can then be analyzed and displayed as a flamegraph.

In the Java landscape, it offers a Java Agent based on async_profiler, compatible with other agents such as the OpenTelemetry agent. Phew!

In 2023, Grafana acquired Pyroscope and merged it with its own solution, Phlare. Welcome to Grafana Pyroscope!

If you want to know more about Continuous Profiling and what it can bring to you, you may want to check out the Grafana Pyroscope documentation.

Objectives

In this section, we aim to show you:

Enable Continuous Profiling

Start the Pyroscope server

πŸ› οΈ We will use the grafana/pyroscope container image: we already defined a pyroscope service in our compose.yml file, but it is not yet enabled. You can start it by enabling the continuous-profiling profile:

$ docker compose --profile=continuous-profiling up -d

βœ… It should start a new service on port 4040.

πŸ› οΈ Go to Grafana Pyroscope dashboard UI on port 4040:

Setup easypay-service for Continuous Profiling

Let's use an agent again to profile our application.

πŸ› οΈ First, download the agent. You can use the provided script to download it as instrumentation/pyroscope.jar:

$ bash ./scripts/download-pyroscope-agent.sh

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 9782k  100 9782k    0     0  10.2M      0 --:--:-- --:--:-- --:--:-- 13.9M
Grafana Pyroscope agent downloaded successfully in ./scripts/../instrumentation

πŸ“ Just like for traces, we should modify the easypay-service/src/main/docker/Dockerfile file:

# ...
# Copy Java Agent into the container
COPY instrumentation/grafana-opentelemetry-java.jar /app/grafana-opentelemetry-java.jar

# Copy pyroscope Java Agent
COPY instrumentation/pyroscope.jar /app/pyroscope.jar

# Add the -javagent flag to setup the JVM to start with our Java Agent
ENTRYPOINT ["java", "-javaagent:/app/grafana-opentelemetry-java.jar", "-javaagent:/app/pyroscope.jar", "-cp","app:app/lib/*","com.worldline.easypay.EasypayServiceApplication"]

πŸ“ Pyroscope can be configured with environment variables. Let's do it directly in the compose.yml file:

  easypay-service:
    # ...
    environment:
      # ...
      PYROSCOPE_APPLICATION_NAME: easypay-service # (1)
      PYROSCOPE_FORMAT: jfr # (2)
      PYROSCOPE_PROFILER_EVENT: wall # (3)
      PYROSCOPE_PROFILER_LOCK: 10ms # (4)
      PYROSCOPE_PROFILER_ALLOC: 512k # (5)
      PYROSCOPE_LOG_LEVEL: debug
      PYROSCOPE_SERVER_ADDRESS: http://pyroscope:4040 # (6)
    # ...
  1. Define an application name (this will create the service_name label),
  2. Set format: JFR allows to have multiple events to be recorded,
  3. Type of event to profile: wall allows to record the time spent in methods. Other valid values are itimer and cpu.
  4. Threshold to record lock events,
  5. Threshord to record memory events,
  6. Server address.

πŸ› οΈ Rebuild and redeploy easypay-service:

$ docker compose build easypay-service
$ docker compose up -d easypay-service

βœ… Check logs for correct startup:

docker compose logs -f easypay-service

You should see additional logs related to Pyroscope.

πŸ‘€ Go back to the Pyroscope dashboard (port 4040):

Grafana setup

πŸ‘€ Let's go to the Grafana Explore dashboard:

πŸ› οΈ Generate some load with k6:

$ k6 run -u 1 -d 5m k6/02-payment-smartbank.js

πŸ‘€ Try to find what is taking the most time in the ProcessPayment.processPayment method: