This workshop aims to introduce how to make a Java application fully observable with:
During this workshop we will use the Grafana stack and Prometheus:
We will also cover the OpenTelemetry Collector which gathers & broadcasts then the data coming from our microservices.
And as we are talking about Grafana, you will use Grafana Alloy which is one of its implementation (and replaces the deprecated Grafana Agent).
Centralises all API calls.
Payment microservices which accepts (or not) payments.
This is how it validates every payment:
If the payment is validated, it stores it and broadcasts it to all the other microservices through Kafka.
After fetching a message from the Kafka topic, this service search in its database if the payment's card number is registered for fraud.
In this case, only a WARN
log is thrown.
For this lab, it only simulates the subscription of messages.
This external service authorizes the payment.
As mentioned earlier, our observability stack is composed of :
In addition, the microservices are started with an agent to broadcast the traces to the collector.
You will the following icons during the workshop:
π οΈ An action to perform,
π A file to modify,
π Something to observe,
β
Validate something,
βΉοΈ Some information.
Skill | Level |
proficient | |
proficient | |
novice | |
proficient | |
novice | |
novice | |
novice | |
novice |
You MUST have set up these tools first:
π οΈ Here are commands to validate your environment:
Java
$ java -version
openjdk version "21.0.3" 2024-04-16 LTS
OpenJDK Runtime Environment Temurin-21.0.3+9 (build 21.0.3+9-LTS)
OpenJDK 64-Bit Server VM Temurin-21.0.3+9 (build 21.0.3+9-LTS, mixed mode, sharing)
Gradle
π οΈ If you use the wrapper, you won't have troubles. Otherwise...:
$ gradle -version
------------------------------------------------------------
Gradle 8.7
------------------------------------------------------------
Build time: 2024-03-22 15:52:46 UTC
Revision: 650af14d7653aa949fce5e886e685efc9cf97c10
Kotlin: 1.9.22
Groovy: 3.0.17
Ant: Apache Ant(TM) version 1.10.13 compiled on January 4 2023
JVM: 21.0.3 (Eclipse Adoptium 21.0.3+9-LTS)
OS: Linux 5.15.146.1-microsoft-standard-WSL2 amd64
Docker Compose
$ docker compose version
Docker Compose version v2.24.7
It's strongly recommended to use Gitpod. You must create an account first. You then can open this project in either your local VS Code or directly in your browser:
We will assume you will use GitPod for this workshop :)
When a messages invites you making an URL public, select and validate it.
The "infrastructure stack" is composed of the following components:
βΉοΈ If you run your application on GitPod, the following step are automatically started during the provisioning of your GitPod environment.
π οΈ Otherwise, to run it on your desktop, execute the following commands
$ bash scripts/download-agent.sh
$ ./gradlew tasks
$ docker compose up -d --build --remove-orphans
β To check if all the services are up, you can run this command:
$ docker compose ps -a
And check the status of every service.
For instance:
β― docker compose ps
NAME IMAGE COMMAND SERVICE CREATED STATUS PORTS
api-gateway api-gateway:latest "java -javaagent:/ap..." api-gateway 8 minutes ago Up 7 minutes (healthy) 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp
config-server config-server:latest "java -javaagent:/ap..." config-server 8 minutes ago Up 7 minutes (healthy) 0.0.0.0:8888->8888/tcp, :::8888->8888/tcp
discovery-server discovery-server:latest "java -javaagent:/ap..." discovery-server 8 minutes ago Up 7 minutes (healthy) 0.0.0.0:8761->8761/tcp, :::8761->8761/tcp
easypay-service easypay-service:latest "java -javaagent:/ap..." easypay-service 8 minutes ago Up 7 minutes (healthy)
fraudetect fraudetect-service:latest "java -javaagent:/ap..." fraudetect-service 8 minutes ago Up 7 minutes (healthy)
kafka confluentinc/cp-kafka:7.6.1 "/etc/confluent/dock..." kafka 8 minutes ago Up 8 minutes (healthy) 9092/tcp, 0.0.0.0:19092->19092/tcp, :::19092->19092/tcp
merchant-backoffice merchant-backoffice:latest "java -javaagent:/ap..." merchant-backoffice 8 minutes ago Up 7 minutes (healthy)
observability-workshop-collector-1 grafana/alloy:latest "/bin/alloy run --se..." collector 8 minutes ago Up 8 minutes 0.0.0.0:4317-4318->4317-4318/tcp, :::4317-4318->4317-4318/tcp, 0.0.0.0:12345->12345/tcp, :::12345->12345/tcp
observability-workshop-grafana-1 grafana/grafana:latest "sh -xeuc 'mkdir -p ..." grafana 8 minutes ago Up 7 minutes 0.0.0.0:3000->3000/tcp, :::3000->3000/tcp
observability-workshop-loki-1 grafana/loki:latest "/usr/bin/loki -conf..." loki 8 minutes ago Up 7 minutes 0.0.0.0:3100->3100/tcp, :::3100->3100/tcp
observability-workshop-postgres-easypay-1 postgres:16 "docker-entrypoint.s..." postgres-easypay 8 minutes ago Up 8 minutes (healthy) 0.0.0.0:5432->5432/tcp, :::5432->5432/tcp
observability-workshop-postgres-fraudetect-1 postgres:16 "docker-entrypoint.s..." postgres-fraudetect 8 minutes ago Up 7 minutes (healthy) 0.0.0.0:5434->5432/tcp, :::5434->5432/tcp
observability-workshop-postgres-merchantbo-1 postgres:16 "docker-entrypoint.s..." postgres-merchantbo 8 minutes ago Up 8 minutes (healthy) 0.0.0.0:5435->5432/tcp, :::5435->5432/tcp
observability-workshop-postgres-smartbank-1 postgres:16 "docker-entrypoint.s..." postgres-smartbank 8 minutes ago Up 7 minutes (healthy) 0.0.0.0:5433->5432/tcp, :::5433->5432/tcp
observability-workshop-prometheus-1 prom/prometheus:v2.52.0 "/bin/prometheus --c..." prometheus 8 minutes ago Up 8 minutes 0.0.0.0:9090->9090/tcp, :::9090->9090/tcp
observability-workshop-tempo-1 grafana/tempo:latest "/tempo -config.file..." tempo 8 minutes ago Up 7 minutes 0.0.0.0:3200->3200/tcp, :::3200->3200/tcp, 0.0.0.0:9095->9095/tcp, :::9095->9095/tcp, 0.0.0.0:9411->9411/tcp, :::9411->9411/tcp, 0.0.0.0:14268->14268/tcp, :::14268->14268/tcp
smartbank-gateway smartbank-gateway:latest "java -Xmx4g -javaag..." smartbank-gateway 8 minutes ago Up 7 minutes (unhealthy)
β Open the Eureka website started during the infrastructure setup. The following instances should be registered with Eureka:
β All services should be registered before continuing...
π οΈ You can now access our platform to initiate a payment:
$ http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780008 expiryDate=789456123 amount:=25000
β You should get the following content:
HTTP/1.1 201 Created
Content-Type: application/json
Date: Wed, 05 Jun 2024 13:42:12 GMT
Location: http://172.19.25.95:44523/payments/3cd8df14-8c39-460b-a429-dc113d003aed
transfer-encoding: chunked
{
"amount": 25000,
"authorId": "5d364f1a-569c-4c1d-9735-619947ccbea6",
"authorized": true,
"bankCalled": true,
"cardNumber": "5555567898780008",
"cardType": "MASTERCARD",
"expiryDate": "789456123",
"paymentId": "3cd8df14-8c39-460b-a429-dc113d003aed",
"posId": "POS-01",
"processingMode": "STANDARD",
"responseCode": "ACCEPTED",
"responseTime": 414
}
One of our customers raised an issue:
Β« When I reach your API, I usually either an AMOUNT_EXCEEDED
or INVALID_CARD_NUMBER
error. Β»
Normally the first thing to do is checking the logs. Before that, we will reproduce these issues.
π οΈ You can check the API as following:
AMOUNT_EXCEEDED
error:$ http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780008 expiryDate=789456123 amount:=51000
HTTP/1.1 201 Created
Content-Type: application/json
Date: Wed, 05 Jun 2024 13:45:40 GMT
Location: http://172.19.25.95:44523/payments/5459b20a-ac91-458f-9578-019c05483bb3
transfer-encoding: chunked
{
"amount": 51000,
"authorId": "6ace318f-b669-4e4a-b366-3f09048becb7",
"authorized": false,
"bankCalled": true,
"cardNumber": "5555567898780008",
"cardType": "MASTERCARD",
"expiryDate": "789456123",
"paymentId": "5459b20a-ac91-458f-9578-019c05483bb3",
"posId": "POS-01",
"processingMode": "STANDARD",
"responseCode": "AUTHORIZATION_DENIED",
"responseTime": 25
}
INVALID_CARD_NUMBER
error:$ http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780007 expiryDate=789456123 amount:=25000
HTTP/1.1 201 Created
Content-Type: application/json
Date: Wed, 05 Jun 2024 13:46:09 GMT
Location: http://172.19.25.95:44523/payments/2dbf3823-fb11-4c63-a540-ab43ac663e68
transfer-encoding: chunked
{
"amount": 51000,
"authorId": null,
"authorized": false,
"bankCalled": false,
"cardNumber": "5555567898780007",
"cardType": null,
"expiryDate": "789456123",
"paymentId": "2dbf3823-fb11-4c63-a540-ab43ac663e68",
"posId": "POS-01",
"processingMode": "STANDARD",
"responseCode": "INVALID_CARD_NUMBER",
"responseTime": 5
}
π οΈ It's time to have a look to our easypay-service
logs!
Service is started in a Docker container. To get its output you can use the following command:
$ docker compose logs -f easypay-service
π Look into the console logs
to pinpoint these issues.
It's time to add more contextual information into our code!
We will use in this workshop SLF4J.
The logger can be created by adding a class variable such as:
private static final Logger log = LoggerFactory.getLogger(BankAuthorService.class);
Think to use the corresponding class to instantiate it!
Use the most appropriate log level
The log level is a fundamental concept in logging. Whether the logging framework you use, it allows you to tag log records according to their severity or importance. For instance, SLF4J offers the following log levels by default:
TRACE
: typically used to provide detailed diagnostic information that can be used for troubleshooting and debugging. Compare to DEBUG messages, TRACE messages are more fine-grained and verbose.DEBUG
: used to provide information that can be used to diagnose issues especially those related to program state.INFO
: used to record events that indicate that program is functioning normally.WARN
: used to record potential issues in your application. They may not be critical but should be investigated.ERROR
: records unexpected errors that occur during the operation of your application. In most cases, the error should be addressed as soon as possible to prevent further problems or outages.AMOUNT_EXCEEDED
issueπ Go the easypay-service/src/main/java/com/worldline/easypay/payment/control/bank/BankAuthorService.java
class and modify the following code block
@Retry(name = "BankAuthorService", fallbackMethod = "acceptByDelegation")
public boolean authorize(PaymentProcessingContext context) {
LOG.info("Authorize payment for {}", context);
try {
var response = client.authorize(initRequest(context));
context.bankCalled = true;
context.authorId = Optional.of(response.authorId());
context.authorized = response.authorized();
return context.authorized;
} catch (Exception e) {
LOG.warn("Should retry or fallback: {}", e.getMessage());
throw e;
}
}
Add a log before the return context.authorized;
instruction to provide contextual information such as the authorId and the result of the call (authorized or not).
For the lazy, you can write something like that:
[...]
context.bankCalled = true;
context.authorId = Optional.of(response.authorId());
context.authorized = response.authorized();
LOG.info("Bank answered payment authorization with authorId={}, authorized={}", response.authorId(), response.authorized());
return context.authorized;
INVALID_CARD_NUMBER
issueπ Go to the easypay-service/src/main/java/com/worldline/easypay/payment/control/PaymentService.java
class and modify the following block code in the process
method in the same way:
private void process(PaymentProcessingContext context) {
[...]
if (!cardValidator.checkCardNumber(context.cardNumber)) {
context.responseCode = PaymentResponseCode.INVALID_CARD_NUMBER;
return;
}
[...]
For this error, you can log the error with the following content:
PaymentProcessingContext
π You can also add more logs, especially in the easypay-service/src/main/java/com/worldline/easypay/payment/control/CardValidator.java
class.
You may simply uncomment all the commented logs lines (// LOG....
), such as:
CarValidator.checkLunKey()
method, add a warning message when the key is not valid. For instance:LOG.warn("Check card number Luhn key does not pass: {}", cardNumber);
CarValidator.checkExpiryDate()
method, add a warning message when a DateTimeParseException
is thrown. For instance:LOG.warn("Check card expiry date does not pass: bad format: {}", expiryDate);
You can go further and add as many logs as you think would help in production π
We can also log payment requests and responses to provide even more context, which could be helpful for following requests in this workshop.
π Modify the easypay-service/src/main/java/com/worldline/easypay/payment/boundary/PaymentResource.java
class by uncommenting all // LOG....
lines (keep MDC lines for later π).
π οΈ You can redeploy the easypay-service
with the following commands:
$ docker compose build easypay-service
$ docker compose up -d easypay-service
π οΈ Now you can run the same commands ran earlier and check again the logs (http POST...
).
Another issue was raised for the POS (Point of Sell) POS-02
(but we didn't know yet!).
π οΈ When you reach the API using this command:
http POST :8080/api/easypay/payments posId=POS-02 cardNumber=5555567898780008 expiryDate=789456123 amount:=25000
π You should get the following output:
HTTP/1.1 500
[...]
{
"error":"Internal Server Error",
"path": "/payments",
"status": 500,
[...]
}
π You then get the following log message:
2024-06-05T15:45:35.215+02:00 ERROR 135386 --- [easypay-service] [o-auto-1-exec-7] o.a.c.c.C.[.[.[/].[dispatcherServlet] : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed: java.lang.NullPointerException: Cannot invoke "java.lang.Boolean.booleanValue()" because "java.util.List.get(int).active" is null] with root cause
java.lang.NullPointerException: Cannot invoke "java.lang.Boolean.booleanValue()" because "java.util.List.get(int).active" is null
at com.worldline.easypay.payment.control.PosValidator.isActive(PosValidator.java:34) ~[main/:na]
at com.worldline.easypay.payment.control.PaymentService.process(PaymentService.java:46) ~[main/:na]
at com.worldline.easypay.payment.control.PaymentService.accept(PaymentService.java:108) ~[main/:na]
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ~[na:na]
at java.base/java.lang.reflect.Method.invoke(Method.java:580) ~[na:na]
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:354) ~[spring-aop-6.1.6.jar:6.1.6]
at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:196) ~[spring-aop-6.1.6.jar:6.1.6]
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-6.1.6.jar:6.1.6]
[...]
π To find the root cause, add first a smart log entry in the easypay-service/src/main/java/com/worldline/easypay/payment/control/PosValidator.java
class. In the isActive()
method, catch the exception and trace the error:
public boolean isActive(String posId) {
PosRef probe = new PosRef();
probe.posId = posId;
try {
List<PosRef> posList = posRefRepository.findAll(Example.of(probe));
if (posList.isEmpty()) {
LOG.warn("checkPosStatus NOK, unknown posId {}", posId);
return false;
}
boolean result = posList.get(0).active;
if (!result) {
LOG.warn("checkPosStatus NOK, inactive posId {}", posId);
}
return result;
} catch (NullPointerException e) {
LOG.warn("Invalid value for this POS: {}", posId);
throw e;
}
}
This is most likely an issue with some data in the database... Thanks to logs, we may quickly get the idea that the problem comes from the point of sale with ID POS-02
.
It is easy to identify the issue if we don't have traffic against our application, but what if we have more realistic traffic?
π οΈ Generate some load with k6
(a Grafana tool for load testing):
$ k6 run -u 5 -d 30s k6/01-payment-only.js
π Check again logs:
$ docker compose logs -f easypay-service
Logs are now interleaved due to several requests being executed concurrently. It becomes harder to identify which point of sale is causing this error...
π If you look into the SQL import file (easypay-service/src/main/resources/db/postgresql/data.sql
), you'll notice a NULL
value instead of a boolean for the active
column.
π Go to the PaymentResource
class and modify the method processPayment()
to instantiate the MDC:
public ResponseEntity<PaymentResponse> processPayment(PaymentRequest paymentRequest)
MDC.put("CardNumber",paymentRequest.cardNumber());
MDC.put("POS",paymentRequest.posId());
[...]
MDC.clear();
return httpResponse;
π Go to the MDC spring profile configuration file (easypay-service/src/main/resources/application-mdc.properties
) and check the configuration of both the CardNumber
& POS
fields.
[...] %clr(CardNumber=){faint}%clr(%X{CardNumber:-null}) %clr(POS=){faint}%clr(%X{POS:-null}) [...]
π Activate the mdc
profile in the compose.yml
file:
easypay-service:
image: easypay-service:latest
[...]
SPRING_PROFILES_ACTIVE: default,docker,mdc
π οΈ Rebuild and redeploy the easypay-service
:
$ docker compose build easypay-service
$ docker compose up -d easypay-service
π οΈ To have more logs, we will run several HTTP requests using K6. Run the following command:
$ k6 run -u 5 -d 5s k6/01-payment-only.js
π Check then the logs to pinpoint some exceptions.
Logs are stored in the logs
folder. You should find a .log.json
file for each service in our application.
We have configured our Spring Boot applications to output logs:
docker compose -f logs
),π You can take a look at log files such as easypay-service.log.json
to see their structure: a lot of information is available!
In this section we will use Promtail to broadcast them to Loki through Grafana Alloy (OTEL Collector).
Loki, our storage backend, is listening on host loki
and port 3100
.
π Add the following Logging configuration to the docker/alloy/config.alloy
file:
////////////////////
// LOGS
////////////////////
// JSON LOG FILES (1)
local.file_match "jsonlogs" {
path_targets = [{"__path__" = "/logs/*.json", "exporter" = "JSONFILE"}]
}
// (2)
loki.source.file "jsonlogfiles" {
targets = local.file_match.jsonlogs.targets
forward_to = [loki.process.jsonlogs.receiver]
}
// (3)
loki.process "jsonlogs" {
forward_to = [loki.write.endpoint.receiver]
// (4)
stage.json {
expressions = {
timestamp = "timestamp",
application = "context.properties.applicationName",
instance = "context.properties.instance",
level = "level",
}
}
// (5)
stage.labels {
values = {
application = "application",
instance = "instance",
level = "level",
}
}
// (6)
stage.timestamp {
source = "timestamp"
format = "RFC3339"
fallback_formats = ["UnixMs",]
}
}
// (7)
// EXPORTER (LOKI)
loki.write "endpoint" {
endpoint {
url = "http://loki:3100/loki/api/v1/push"
}
}
Thanks to this configuration, the JSON files are automatically grabbed and broadcast to Loki. Here is a short summary of the following steps:
.json
files in the logs
directory)π οΈ Restart the Grafana Alloy collector:
$ docker compose restart collector
π Open a browser page to Alloy UI (port 12345
):
http://localhost:12345/graph
,loki.source.file
component named jsonlogfiles
,π οΈ Go to Grafana (port 3000
):
Explore
dashboard,Grafana offers a form to help you build your queries:
You can also use a dedicated query language to make your queries directly: this is named LogQL.
π οΈ Let's get logs from the easypay-service
:
Label filter
, select the application as easypay-service
,Run Query
,Now we want to display logs in a table format with the timestamp, the log level and the message.
In order to do that, we should extract the message from the JSON as it is not done automatically.
π οΈ Modify the query such as:
Label filters
for the application easypay-service
,+ Operations
and select Formats
> Json
: this will let us parse the log as JSON,+ Expression
to extract the JSON message
field,message="message"
to extract the value in a field with the same name,Run query
and unfold a log: you should see a new field named message
in the Fields
group.π οΈ Time to format:
Table
just below the Logs volume
graph, to the right,Time
, level
and message
fields.π οΈ Do not hesitate to hit the easypay payment endpoint with curl/httpie or k6 to generate some logs (whichever you prefer):
http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780008 expiryDate=789456123 amount:=40000
# OR
k6 run -u 1 -d 2m k6/01-payment-only.js
π You can also view logs for the other services (e.g., api-gateway
).
π You can also search logs based on the correlation ID.
Maybe another issue? Do you see the card numbers? π¨
For compliance and to prevent personal data loss, we will obfuscate the card number in the logs.
π οΈ In the Alloy configuration file (docker/alloy/config.alloy
), add a luhn stage at the end of the loki.process "jsonlogs"
block definition:
loki.process "jsonlogs" {
//...
stage.luhn {
min_length = 13
replacement = "**MASKED**"
}
}
π οΈ Restart the collector:
$ docker compose restart collector
π οΈ Generate some logs with curl/httpie or k6.
β
Check the card numbers are now obfuscated with the **MASKED**
content.
Let's take control of our application's metrics!
π οΈ Check out the easypay-service
metrics definitions exposed by Spring Boot Actuator first:
$ http :8080/actuator/metrics
π Explore the output.
π οΈ Now get the Prometheus metrics using this command:
$ http :8080/actuator/prometheus
This is an endpoint exposed by Actuator to let the Prometheus server get your application metrics.
π Check out the Prometheus (docker/prometheus/prometheus.yml
) configuration file. All the scraper's definitions are configured here.
For instance, here is the configuration of the config-server
:
- job_name: prometheus-config-server
scrape_interval: 5s
scrape_timeout: 5s
metrics_path: /actuator/prometheus
static_configs:
- targets:
- config-server:8890
You can see it uses the endpoint we looked into earlier under the hood. But it is a static configuration: we should tell Prometheus where to look for metrics...
Hopefully, Prometheus is also able to query our discovery-server
(Eureka service discovery) for discovering what are all the plugged services of our application. It will then scrape them in the same way:
# Discover targets from Eureka and scrape metrics from them
- job_name: eureka-discovery
scrape_interval: 5s
scrape_timeout: 5s
eureka_sd_configs:
- server: http://discovery-server:8761/eureka (1)
refresh_interval: 5s
relabel_configs: (2)
- source_labels: [__meta_eureka_app_instance_metadata_metrics_path]
target_label: __metrics_path__
discovery-server
to explore all the metrics of the underlying systemsπ You can have an overview of all the scraped applications on the Prometheus dashboard:
http://localhost:9090
if you started the stack locally, or use the link provided by GitPod in the PORTS
view for port 9090
,Status
> Targets
,eureka-discovery
section: easypay-service
...π Modify the docker/prometheus/prometheus.yml
file to add a new configuration to scrape the easypay service. You can use the prometheus-config-server
configuration as a model:
prometheus-easypay-service
5s
/actuator/prometheus
easypay-service:8080
That should give you the following yaml configuration:
- job_name: prometheus-easypay-service
scrape_interval: 5s
scrape_timeout: 5s
metrics_path: /actuator/prometheus
static_configs:
- targets:
- easypay-service:8080
π οΈ Restart the Prometheus server to take into account this new configuration:
$ docker compose restart prometheus
π Now explore again the targets (Status
> Targets
) on the Prometheus dashboard (port 9090
).
π οΈ Go to Grafana and start again an Explore
dashboard.
π οΈ Select the Prometheus
datasource instead of the Loki
one.
In this section you will hand on the metrics query builder of Grafana.
The Metric
field lists all the metrics available in the Prometheus server: take time to explore them.
π οΈ For example, you can select the metric named jvm_memory_used_bytes
, and click on the Run query
button to plot the memory usage of all your services by memory area,
π οΈ If you want to plot the total memory usage of your services:
Operations
and select Aggregations
> Sum
, and Run query
: you obtain the whole memory consumption of all your JVMs,By label
button and select the label named application
(do not forget to click on Run query
afterthat).π οΈ You can also filter metrics to be displayed using Label filters
: try to create a filter to display only the metric related to the application named easypay-service.
With Grafana, you can either create your own dashboards or import some provided by the community from Grafana's website.
We will choose the second solution right now and import the following dashboards:
12271
,20729
.π οΈ To import these dashboards:
port 3000
), and select the Dashboards
section on the left,New
(top right), and click on Import
,Find and import...
field, just paste the ID of the dashboard and click on Load
,Select a Prometheus data source
, select Prometheus
and click on Import
,π Explore the JVM Micrometer
dashboard: it works almost out of box.
It contains a lot of useful information about JVMs running our services.
The application
filter (top of the dashboard) let you select the service you want to explore metrics.
π οΈ Now let's simulate some traffic using Grafana K6. Run the following command:
$ k6 run -u 5 -d 2m k6/02-payment-smartbank.js
π Go back to the Grafana dashboard, click on Dashboards
and select JVM Micrometer
:
easypay-service
, especially the Garbage collector and CPU statistics.Spring Boot JDBC & HikariCP
dashboard then and see what happens on the database connection pool for easypay-service
.We were talking about an incident, isn't it?
π Let's go back to the Explore view of Grafana, select Loki as a data source and see what happens!
π οΈ Create a query with the following parameters to get error logs of the smartbank-gateway
service:
application
= smartbank-gateway
expression
= level="level"
label
= level ;
operator=
=~;
value=
WARN|ERROR``π οΈ Click on Run query
and check out the logs.
Normally you would get a java.lang.OutOfMemoryError
due to a saturated Java heap space.
π To get additional insights, you can go back to the JVM dashboard and select the smartbank-gateway
application.
Normally you will see the used JVM Heap reaching the maximum allowed.
Observability is not only about incidents. You can also define your own metrics.
Micrometer, the framework used by Spring Boot Actuator to expose metrics, provides an API to create your own metrics quite easily:
Let's go back to our code!
We want to add new metrics to the easypay service to measure they payment processing and store time.
So we target a metric of Timer type.
In order to achieve this goal, we will measure the time spent in the two methods process
and store
of the com.worldline.easypay.payment.control.PaymentService
class of the easypay-service
module.
This class is the central component responsible for processing payments: it provides the accept
public method, which delegates its responsibility to two private ones:
process
: which does all the processing of the payment: validation, calling third parties...store
: to save the processing result in database.We also want to count the number of payment requests processed by our system. We will use a metric of Counter type.
You can take a look at the Micrometer's documentation about Timers.
We need to declare two timers in our code:
processTimer
to record the rivieradev.payment.process
metric: it represents the payment processing time and record the time spent in the process
method,storeTimer
to record the rivieradev.payment.store
metric: it represents the time required to store a payment in database by recording the time spent in the store
method.π Let's modify the com.worldline.easypay.payment.control.PaymentService
class to declare them:
// ...
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
@Service
public class PaymentService {
// ...
private Timer processTimer; (1)
private Timer storeTimer;
public PaymentService(
// ...,
MeterRegistry meterRegistry) { (2)
// ...
processTimer = Timer (3)
.builder("rivieradev.payment.process") (4)
.description("Payment processing time") (5)
.register(meterRegistry); (6)
storeTimer = Timer
.builder("rivieradev.payment.store")
.description("Payment store time")
.register(meterRegistry);
}
MeterRegistry
provided by Spring Boot Actuator in the class constructor, as it is required to initialize the timers,The Timer
API allows to record blocks of code. It's going to be our way:
timer.record(() -> {
// some code
});
π Let's modify our process
and store
methods to record our latency with the new metrics.
We can simply wrap our original code in a Java Runnable
functional interface:
// ...
private void process(PaymentProcessingContext context) {
processTimer.record(() -> { (1)
if (!posValidator.isActive(context.posId)) {
context.responseCode = PaymentResponseCode.INACTIVE_POS;
return;
}
// ...
});
}
private void store(PaymentProcessingContext context) {
storeTimer.record(() -> { (2)
Payment payment = new Payment();
// ...
});
}
process
method to wrap all its content into a Runnable
consumed by the record
method of our processTimer
timer,store
method.π Let's do the same for the counter:
// ...
import io.micrometer.core.instrument.Counter;
@Service
public class PaymentService {
//...
private Counter requestCounter; (1)
public PaymentService(
//...
) {
// ...
requestCounter = Counter (2)
.builder("rivieradev.payment.requests")
.description("Payment requests counter")
.register(meterRegistry);
}
π The method accept
of the PaymentService
class is invoked for each payment request, it is a good candidate to increment our counter:
@Transactional(Transactional.TxType.REQUIRED)
public void accept(PaymentProcessingContext paymentContext) {
requestCounter.increment(); (1)
process(paymentContext);
store(paymentContext);
paymentTracker.track(paymentContext);
}
π οΈ Rebuild the easypay-service:
$ docker compose build easypay-service
π οΈ Redeploy easypay:
$ docker compose up -d easypay-service
π οΈ Once easypay is started (you can check logs with the docker compose logs -f easypay-service
command and wait for an output like Started EasypayServiceApplication in 32.271 seconds
):
$ http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780008 expiryDate=789456123 amount:=40000
actuator/prometheus
endpoint to look at our new metrics:$ docker compose exec -it easypay-service sh
/ $ curl http://localhost:8080/actuator/prometheus | grep riviera
You should get this kind of output:
# HELP rivieradev_payment_process_seconds Payment processing time
# TYPE rivieradev_payment_process_seconds summary
rivieradev_payment_process_seconds_count{application="easypay-service",...} 4
rivieradev_payment_process_seconds_sum{application="easypay-service",...} 1.984019362
# HELP rivieradev_payment_process_seconds_max Payment processing time
# TYPE rivieradev_payment_process_seconds_max gauge
rivieradev_payment_process_seconds_max{application="easypay-service",...} 1.927278528
# HELP rivieradev_payment_store_seconds Payment store time
# TYPE rivieradev_payment_store_seconds summary
rivieradev_payment_store_seconds_count{application="easypay-service",...} 4
rivieradev_payment_store_seconds_sum{application="easypay-service",...} 0.299205989
# HELP rivieradev_payment_store_seconds_max Payment store time
# TYPE rivieradev_payment_store_seconds_max gauge
rivieradev_payment_store_seconds_max{application="easypay-service",...} 0.291785969
# HELP rivieradev_payment_requests_total Payment requests counter
# TYPE rivieradev_payment_requests_total counter
rivieradev_payment_requests_total{application="easypay-service",...} 4.0
When using a Timer
you get three metrics by default, suffixed by:
_count
: the number of hits,_sum
: the sum of time spent in the method,_max
: the maximum time spent in the method.Especially we can get the average time spent in the method by dividing the sum by the count.
Finally, our Counter
is the last metric suffixed with _total
.
As we are talking about latencies, you may be also interested in histograms to get the distribution of the events per buckets or percentiles values (the famous 0.99, 0.999...). Fortunately, Timers
allow to compute Histograms and Percentiles!
π Modify the two timers as follows:
// ...
@Service
public class PaymentService {
// ...
public PaymentService(
// ...
) {
// ...
processTimer = Timer
.builder("rivieradev.payment.process")
.description("Payment processing time")
.publishPercentileHistogram() (1)
.publishPercentiles(0.5, 0.90, 0.95, 0.99, 0.999) (2)
.register(meterRegistry);
storeTimer = Timer
.builder("rivieradev.payment.store")
.description("Payment store time")
.publishPercentileHistogram()
.publishPercentiles(0.5, 0.90, 0.95, 0.99, 0.999)
.register(meterRegistry);
}
Timer
to publish a histogram allowing to compute aggregable percentiles server-side,π οΈ Repeat the step 4 (redeploy easypay-service
and get Prometheus metrics in-container).
You should get way more metrics, especially a new one type suffixed with _bucket
:
# HELP rivieradev_payment_process_seconds Payment processing time
# TYPE rivieradev_payment_process_seconds histogram
rivieradev_payment_process_seconds_bucket{application="easypay-service",instance="easypay-service:a44149cd-937a-4e96-abc2-0770343e49bc",namespace="local",le="0.001"} 0
// ...
rivieradev_payment_process_seconds_bucket{application="easypay-service",instance="easypay-service:a44149cd-937a-4e96-abc2-0770343e49bc",namespace="local",le="30.0"} 0
rivieradev_payment_process_seconds_bucket{application="easypay-service",instance="easypay-service:a44149cd-937a-4e96-abc2-0770343e49bc",namespace="local",le="+Inf"} 0
rivieradev_payment_process_seconds_count{application="easypay-service",instance="easypay-service:a44149cd-937a-4e96-abc2-0770343e49bc",namespace="local"} 0
rivieradev_payment_process_seconds_sum{application="easypay-service",instance="easypay-service:a44149cd-937a-4e96-abc2-0770343e49bc",namespace="local"} 0.0
# HELP rivieradev_payment_process_seconds_max Payment processing time
# TYPE rivieradev_payment_process_seconds_max gauge
rivieradev_payment_process_seconds_max{application="easypay-service",instance="easypay-service:a44149cd-937a-4e96-abc2-0770343e49bc",namespace="local"} 0.0
βΉοΈ Each bucket
contains the number of event which lasts less than the value defined in the le
tag.
π οΈ Go back to Grafana (port 3000
), and go into the Dashboards
section.
π οΈ We will import the dashboard defined in the docker/grafana/dashboards/easypay-monitoring.json
file:
New
(top right), and select Import
,Import via dashboard JSON model
field, paste the content of the easypay-monitoring.json
file and click on Load
,You should be redirected to the Easypay Monitoring
dashboard.
It provides some dashboards we have created from the new metrics you exposed in your application:
Payment request count total (rated)
: represents the number of hit per second in our application computed from our counter,Payment Duration distribution
: represents the various percentiles of our application computed from the rivieradev_payment_process
timer and its histogram,Requests process performance
and Requests store performance
: are a visualization of the buckets of the two timers we created previously.π οΈ You can generate some load to view your dashboards evolving live:
$ k6 run -u 2 -d 2m k6/01-payment-only.js
In this section, we'll explore distributed tracing, the third pillar of application observability.
Distributed tracing is an essential tool for monitoring and analyzing the performance of complex applications. It tracks the flow of requests across multiple services and components, helping to identify bottlenecks and improve efficiency β particularly useful for intricate systems like Easypay.
With Spring Boot, there are a couple of approaches to incorporate distributed tracing into your application:
For this workshop, we'll use the Java Agent method and, with a focus on Grafana, we will employ their version of the OpenTelemetry Java Agent.
The Grafana Alloy collector will be used once again, tasked with receiving traces and forwarding them to the Tempo backend.
Lastly, we will use Grafana to examine and interpret these traces, allowing us to better understand and optimize our application's performance.
To capture the entire transaction across all services in a trace, it's essential to instrument all the services in our application.
If you're using GitPod, the Java Agent should already be available in the instrumentation/grafana-opentelemetry-java.jar
directory.
π οΈ If you are participating in this workshop on your workstation, or if the file is missing, you can run the following script to download it:
$ bash -x scripts/download-agent.sh
π Since we are deploying the easypay-service using Docker, we need to modify the last lines of the easypay-service/src/main/docker/Dockerfile
:
# ...
USER javauser
# Copy Java Agent into the container
COPY instrumentation/grafana-opentelemetry-java.jar /app/grafana-opentelemetry-java.jar
# Add the -javagent flag to setup the JVM to start with our Java Agent
ENTRYPOINT ["java", "-javaagent:/app/grafana-opentelemetry-java.jar", "-cp","app:app/lib/*","com.worldline.easypay.EasypayServiceApplication"] # (2)
The ENTRYPOINT
instruction specifies the default command that will be executed when the container starts.
π οΈ You can now build the updated easypay-service container image:
$ docker compose build easypay-service
It's time to set up Grafana Alloy for handling telemetry data. We will configure it to accept traces through the OpenTelemetry GRPC protocol (OTLP) on port 4317
, and then forward them to Grafana Tempo, which listens on the host tempo
on the same port 4317
(this setup specifically handles OTLP traces).
π Please add the following configuration to the docker/alloy/config.alloy
file:
// ...
// RECEIVER SETUP (OTLP GRPC) (1)
otelcol.receiver.otlp "default" {
grpc {
endpoint = "0.0.0.0:4317"
}
output {
traces = [otelcol.processor.batch.default.input]
}
}
// BATCH PROCESSING FOR OPTIMIZATION (2)
otelcol.processor.batch "default" {
output {
traces = [otelcol.exporter.otlp.tempo.input]
}
}
// TRACE EXPORTING TOβ―TEMPO (OTLP) (3)
otelcol.exporter.otlp "tempo" {
client {
endpoint = "tempo:4317"
tls {
insecure = true
}
}
}
otelcol.receiver.otlp
receiver to accept telemetry data over the OTEL protocol via GRPC, listening on port 4317
,otelcol.exporter.otlp
exporter to send collected telemetry data to the Grafana Tempo service.βΉοΈ The Grafana OpenTelemetry Java Agent is pre-configured to transmit telemetry data directly to the collector. This setup is facilitated through environment variables specified in the compose.yml
file:
services:
# ...
easypay-service:
# ..
environment:
# ...
OTEL_SERVICE_NAME: easypay-service (1)
OTEL_EXPORTER_OTLP_ENDPOINT: http://collector:4317 (2)
OTEL_EXPORTER_OTLP_PROTOCOL: grpc (3)
# ...
OTEL_SERVICE_NAME
defines a service name which will be attached to traces to identify the instrumented service,OTEL_EXPORTER_OTLP_ENDPOINT
environment variable configures where the telemetry data should be sent,OTEL_EXPORTER_OTLP_PROTOCOL
sets the OTLP protocol used behind, here GRPC (can be HTTP).π οΈ To apply the new settings, restart Grafana Alloy with the following command:
$ docker compose restart collector
β
After restarting, verify that Grafana Alloy is up and running with the updated configuration by accessing the Alloy dashboard on port 12345
.
π οΈ Redeploy the updated easypay-service
:
$ docker compose up -d easypay-service
β To ensure easypay-service has started up correctly, check its logs with:
$ docker compose logs -f easypay-service
β If Java agent was correctly taken into account, logs should start with:
easypay-service | OpenJDK 64-Bit Server VM warning: Sharing is only supported for boot loader classes because bootstrap classpath has been appended
easypay-service | [otel.javaagent 2024-07-04 15:50:40:219 +0000] [main] INFO io.opentelemetry.javaagent.tooling.VersionLogger - opentelemetry-javaagent - version: 2.4.0
easypay-service | [otel.javaagent 2024-07-04 15:50:40:599 +0000] [main] INFO com.grafana.extensions.instrumentations.TestedInstrumentationsContext - Grafana OpenTelemetry Javaagent: version=2.4.0-beta.1, includeAllInstrumentations=true, useTestedInstrumentations=false, includedUntestedInstrumentations=[], excludedInstrumentations=[]
easypay-service | [otel.javaagent 2024-07-04 15:50:41:114 +0000] [main] INFO io.opentelemetry.sdk.resources.Resource - Attempting to merge Resources with different schemaUrls. The resulting Resource will have no schemaUrl assigned. Schema 1: https://opentelemetry.io/schemas/1.24.0 Schema 2: https://opentelemetry.io/schemas/1.23.1
π οΈ Generate some load on the application to produce traces:
$ k6 run -u 1 -d 5m k6/01-payment-only.js
π οΈ Let's explore your first traces in Grafana:
Explore
dashboard,Tempo
data source and click on Run query
to refresh the view.π Click on Service Graph
and explore the Node graph
: this view is extremely helpful for visualizing and understanding how our services communicate with each other.
π Go back to Search
and click on Run query
. You should see a table named Table - Traces
. By default, this view provides the most recent traces available in Tempo.
π οΈ Let's find an interesting trace using the query builder:
easypay-service
with a duration greater than 50 ms: POST easypay-service
trace
>
50ms
Run query
.Duration
(click on the column name) to find the slowest trace.Trace ID
.You should see the full stack of the corresponding transaction.
π Grafana should open a new view (you can enlarge it by clicking on the three vertical dots and selecting Widen pane
):
api-gateway
to easypay-service
and easypay-service
to smartbank-gateway
).easypay-service
to fraudetect-service
and merchant-backoffice
).π οΈ Grafana allows to display a graph of spans as interconnected nodes:
Additional settings
,Enable node graph
option.Node graph
to get a graphical view of all the spans participating in the trace.π οΈ Continue your exploration in the Search
pane:
Status
=
error
filter to see only traces that contain errors,NullPointerException
.When we instrument our services using the agent, every interaction, including Prometheus calls to the actuator/prometheus
endpoint, is recorded.
In the Service Graph
you should have seen a link between the User
and services other than the api-gateway
it seems not normal for us: we only created payments through the api-gateway
!
π If you click on the link and select View traces
you should see lot of traces regarding actuator/prometheus
.
To avoid storing unnecessary data in Tempo, we can sample the data in two ways:
In this workshop, we will implement Tail Sampling.
Modify the Alloy configuration file (docker/alloy/config.alloy
) as follows:
// ...
// RECEIVER (OTLP)
otelcol.receiver.otlp "default" {
grpc {
endpoint = "0.0.0.0:4317"
}
output {
traces = [otelcol.processor.tail_sampling.actuator.input] // (1)
}
}
// TAIL SAMPLING (2)
otelcol.processor.tail_sampling "actuator" {
// Filter on http.url attribute (3)
policy {
name = "filter_http_url"
type = "string_attribute"
string_attribute {
key = "http.url"
values = ["/actuator/health", "/actuator/prometheus"]
enabled_regex_matching = true
invert_match = true
}
}
// Filter on url.path attribute (3)
policy {
name = "filter_url_path"
type = "string_attribute"
string_attribute {
key = "url.path"
values = ["/actuator/health", "/actuator/prometheus"]
enabled_regex_matching = true
invert_match = true
}
}
output {
traces = [otelcol.processor.batch.default.input] // (4)
}
}
otelcol.receiver.otlp
to export traces to the otelcol.processor.tail_sampling component defined just after.otelcol.processor.tail_sampling
component.otelcol.processor.batch
processor we defined previously.This configuration will filter the SPANs created from /actuator
API calls.
π οΈ Restart the Alloy collector:
$ docker compose restart collector
Starting from this moment, you should no longer see traces related to actuator/health
or actuator/prometheus
endpoints.
Just like metrics, it is also possible to add your own spans on arbitrary methods to provide more business value to the observability of your application.
Let's return to our code!
We want to add new spans to the traces generated in the easypay-service
application to track payment processing and store events.
To achieve this goal, we will create new spans when the process
and store
methods of the com.worldline.easypay.payment.control.PaymentService
class in the easypay-service
module are invoked.
As a reminder, this class is the central component responsible for processing payments. It provides the public method accept
, which delegates its responsibilities to two private methods:
process
: which handles all the processing of the payment, including validation and calling third parties.store
: which saves the processing result in the database.We need to add the io.opentelemetry.instrumentation:opentelemetry-instrumentation-annotations
dependency to our module to access some useful annotations.
π This has already been done in advance for this workshop. The following dependencies were added to the Gradle build file (build.gradle.kts
) of the easypay-service
module:
dependencies {
//...
// Add opentelemetry support
implementation(platform("io.opentelemetry:opentelemetry-bom:1.38.0"))
implementation("io.opentelemetry:opentelemetry-api")
implementation("io.opentelemetry.instrumentation:opentelemetry-instrumentation-annotations:2.5.0")
// ...
}
π To add new spans based on methods, we can simply use the @WithSpan
Java annotation. When a traced transaction invokes the annotated method, a new span will be created. Here's how to do it:
// ...
import io.opentelemetry.instrumentation.annotations.WithSpan;
@Service
public class PaymentService {
// ...
@WithSpan("RivieraDev: Payment processing method")
private void process(PaymentProcessingContext context) {
//...
}
@WithSpan("RivieraDev: Payment store method")
private void store(PaymentProcessingContext context) {
//...
}
π We can also provide additional information to the span, such as method parameters using the @SpanAttribute
annotation:
// ...
import io.opentelemetry.instrumentation.annotations.SpanAttribute;
@Service
public class PaymentService {
// ...
@WithSpan("RivieraDev: Payment processing method")
private void process(@SpanAttribute("context") PaymentProcessingContext context) { // <-- HERE
// ...
}
@WithSpan("RivieraDev: Payment store method")
private void store(@SpanAttribute("context") PaymentProcessingContext context) { // <-- HERE
// ...
}
This will provide the whole PaymentProcessingContext into the trace.
π οΈ As we did before:
$ docker compose build easypay-service
$ docker compose up -d easypay-service
π οΈ Generate some payments:
$ http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780008 expiryDate=789456123 amount:=40000
π Go back to Grafana and try to find your new traces using what you've learned previously. Observe the spans you added.
Grafana allows correlation between all our telemetry data:
When discussing observability, correlation is essential. It enables you to diagnose the root cause of an issue quickly by using all the telemetry data involved.
π οΈ Let's go back to the Grafana Explore
dashboard:
Loki
data source,easypay-service
,π Now check to see if there is a mdc
JSON element that includes both the trace_id
and span_id
. These will help us correlate our different request logs and traces.
π οΈ In Grafana, go to Connections
> Data sources
:
Loki
data source,Derived fields
configuration: Name
: TraceID
Type
: Regex in log line
Regex
: "trace_id":"(\w+)"
Query
: ${__value.raw}
URL Label
: View Trace
Internal Link
and select Tempo
.β To validate the configuration, you can put an example log message in this view:
Show example log message
,{"sequenceNumber":0,"timestamp":1719910676210,"nanoseconds":210150314,"level":"INFO","threadName":"kafka-binder-health-1","loggerName":"org.apache.kafka.clients.NetworkClient","context":{"name":"default","birthdate":1719907122576,"properties":{"applicationName":"easypay-service","instance":"easypay-service:9a2ac3f0-c41e-4fcd-8688-123993f1d5db"}},"mdc": {"trace_id":"8b277041692baa8167de5c67977d6571","trace_flags":"01","span_id":"13ff9e44be450b8e"},"message":"[Consumer clientId=consumer-null-1, groupId=null] Node -1 disconnected.","throwable":null}
It should display a table:
Name
: TraceID
Value
: the trace ID from the log messageUrl
: the same trace IDπ οΈ Go back to the Grafana Explore
dashboard and try to find the same kind of log message:
Fields
and Links
sections,View Trace
,π Grafana should open a pane with the corresponding trace from Tempo!
Now you can correlate logs and traces!
If you encounter any exceptions in your error logs, you can now see where it happens and get the bigger picture of the transaction from the customer's point of view.
First of all, logs should contain the trace_id
information.
Most frameworks handle this for you. Whenever a request generates a trace or span, the value is placed in the MDC (Mapped Diagnostic Context) and can be printed in the logs.
On the other hand, Grafana has the ability to parse logs to extract certain values for the Loki data source. This is the purpose of Derived fields
.
When configuring the Loki data source, we provided Grafana with the regex to extract the trace ID from logs and linked it to the Tempo data source. Behind the scenes, Grafana creates the bridge between the two telemetry data sources. And that's all π
Exemplars are annotations used in metrics that link specific occurrences, like logs or traces, to data points within a metric time series. They provide direct insights into system behaviors at moments captured by the metric, aiding quick diagnostics by showing related trace data where anomalies occur. This feature is valuable for debugging, offering a clearer understanding of metrics in relation to system events.
π οΈ Generate some load towards the easypay-service
:
$ k6 run -u 1 -d 2m k6/01-payment-only.js
π Now, let's see how exemplars are exported by our service:
easypay-service
container::$ docker compose exec -it easypay-service sh
$ curl http://localhost:8080/actuator/metrics -H 'Accept: application/openmetrics-text' | grep 'trace_id'
You should obtain metrics with the following format:
http_server_requests_seconds_bucket{application="easypay-service",error="none",exception="none",instance="easypay-service:39a9ae31-f73a-4a63-abe5-33049b8272ca",method="GET",namespace="local",outcome="SUCCESS",status="200",uri="/actuator/prometheus",le="0.027962026"} 1121 # {span_id="d0cf53bcde7b60be",trace_id="969873d828346bb616dca9547f0d9fc9"} 0.023276118 1719913187.631
The interesting part starts after the #
character, this is the so-called exemplar:
SPAN ID TRACE ID VALUE TIMESTAMP
# {span_id="d0cf53bcde7b60be",trace_id="969873d828346bb616dca9547f0d9fc9"} 0.023276118 1719913187.631
That could be translated by:
easypay-service
handled an HTTP request,969873d828346bb616dca9547f0d9fc9
,0.023276118
second,1719913187.631
π Exemplars can be analyzed in Grafana:
Explore
view,Prometheus
data source,Code
mode (button on the right),http_server_requests_seconds_count{application="easypay-service", uri="/payments"}
Options
section and enable Exemplars
,Run query
.π In addition to the line graph, you should see square dots at the bottom of the graph:
trace_id
.π οΈ In Grafana, go to the Connections
> Data sources
section:
Prometheus
data source,Exemplars
: Internal link
and select the Tempo data source,URL Label
: Go to Trace
,Label name:
: trace_id
(as displayed in the exemplar values),Save & test
.π οΈ Go back to the Grafana Explore
dashboard and try to find the same exemplar as before:
Go to Trace
next to the trace_id
label,π Grafana should open a new pane with the corresponding trace from Tempo!
We have added a new correlation dimension to our system between metrics and traces!
Regardless of the integration in Grafana and the setup in the Prometheus data source, we had to configure our application to make Micrometer expose exemplars with trace identifiers.
Firstly, we added the following dependencies to our application (easypay-service/build.gradle.kts
), especially the io.prometheus:prometheus-metrics-tracer-otel-agent
which provides the necessary classes:
dependencies {
// ...
implementation(platform("io.opentelemetry:opentelemetry-bom:1.38.0"))
implementation("io.opentelemetry:opentelemetry-api")
implementation("io.prometheus:prometheus-metrics-tracer-otel-agent:1.3.1")
}
Then, in our Spring Boot application, we added a new class, annotated with @Configuration
, that provides an OpenTelemetryAgentSpanContext
bean. This bean is able to retrieve the current trace ID from distributed tracing. Here is the implementation of com.worldline.easypay.config.PrometheusRegistryConfiguration.java
:
package com.worldline.easypay.config;
import org.springframework.boot.autoconfigure.condition.ConditionalOnClass;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import io.prometheus.metrics.tracer.otel_agent.OpenTelemetryAgentSpanContext;
@Configuration // (1)
public class PrometheusRegistryConfiguration {
@Bean
@ConditionalOnClass(name="io.opentelemetry.javaagent.shaded.io.opentelemetry.api.trace.Span") // (2)
public OpenTelemetryAgentSpanContext exemplarConfigSupplier() {
return new OpenTelemetryAgentSpanContext(); // (3)
}
}
OpenTelemetryAgentSpanContext
, which will be used by the Micrometer Prometheus registry to export exemplars based on the trace ID.At this point, your application is ready to export exemplars with metrics once an OpenTelemetry Java agent is attached to the JVM!
We are able to link logs to traces thanks to Loki's data source Derived fields
, but what about traces to logs?
Fortunately, the Tempo data source can be configured the same way!
π οΈ In Grafana, go to Connections
> Data sources
:
Tempo
data source,Trace to logs
as follows: Loki
,-5m
,5m
,service.name
as application
,Filter by trace ID
if and only if you want to show logs that match the trace ID.Save & test
.βΉοΈ Just to give you some context about this configuration:
service.name
in Tempo and application
in Loki) serves this purpose.π οΈ Hit the easypay payment endpoint with curl or k6 to generate some traces (whichever you prefer):
http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780008 expiryDate=789456123 amount:=40000
k6 run -u 1 -d 2m k6/01-payment-only.js
π οΈ Open the Grafana Explore
dashboard to find a trace:
{name="POST easypay-service"}
POST easypay-service
π A new LOG icon should appear for each line of the trace (middle of the screen):
Filter by trace ID
option in the Tempo data source configuration to see the difference.Yeah, we have added a new dimension to the correlation of our telemetry data, further improving our observability.
The last correlation we will explore in today's workshop is between traces and metrics.
Sometimes, we are interested in knowing the state of our application when inspecting traces, such as JVM heap or system CPU usage.
In this section, we will configure the Tempo data source to link our traces to these metrics.
π οΈ In Grafana, go to Connections
> Data sources
:
Tempo
data source,Trace to metrics
as follows: Prometheus
,-2m
,2m
,service.name
as application
π οΈ Now, we will add some metric queries (click on + Add query
for each query):
Heap Usage (ratio)
,sum(jvm_memory_used_bytes{$__tags})/sum(jvm_memory_max_bytes{$__tags})
System CPU Usage
,system_cpu_usage{$__tags}
π οΈ Finally click on Save & test
and go back to Grafana Explore
dashboard to test our new setup.
π οΈ Hit the easypay payment endpoint with curl or k6 to generate some traces (whichever you prefer):
http POST :8080/api/easypay/payments posId=POS-01 cardNumber=5555567898780008 expiryDate=789456123 amount:=40000
k6 run -u 1 -d 2m k6/01-payment-only.js
π οΈ Open the Grafana Explore
dashboard to find a trace:
{name="POST easypay-service"}
POST easypay-service
π The previous LOG icon has been replaced by a link icon (middle of the screen):
Heap Usage (ratio)
, System CPU Usage
and Related logs
,Related logs
behaves the same way as the previous LOG button.This was the last correlation dimension we wanted to show you!
The OpenTelemetry project standardizes telemetry signals, particularly the logs, metrics, and traces we have seen so far.
However, just a few months ago, they announced their work on a fourth signal: profiling.
Profiling involves measuring the performance characteristics of your application, such as execution time, CPU utilization, or memory usage. It helps identify bottlenecks and optimize resource usage in your application.
If you're familiar with Java, you may already know async_profiler for HotSpot JVMs. It defines itself as a "low overhead sampling profiler for Java."
You may have also heard about eBPF, a technology embedded in the Linux kernel that allows running code in a sandbox within the kernel space. This technology is gaining traction in service meshes and in continuous profiling.
Continuous profiling is an ongoing area of interest in the observability field, aimed at finding additional performance improvements.
Pyroscope was an open-source project for continuous profiling. It consists of a server that receives profiling samples, which can then be analyzed and displayed as a flamegraph.
In the Java landscape, it offers a Java Agent based on async_profiler, compatible with other agents such as the OpenTelemetry agent. Phew!
In 2023, Grafana acquired Pyroscope and merged it with its own solution, Phlare. Welcome to Grafana Pyroscope!
If you want to know more about Continuous Profiling and what it can bring to you, you may want to check out the Grafana Pyroscope documentation.
In this section, we aim to show you:
π οΈ We will use the grafana/pyroscope
container image: we already defined a pyroscope service in our compose.yml
file, but it is not yet enabled. You can start it by enabling the continuous-profiling
profile:
$ docker compose --profile=continuous-profiling up -d
β
It should start a new service on port 4040
.
π οΈ Go to Grafana Pyroscope dashboard UI on port 4040
:
Let's use an agent again to profile our application.
π οΈ First, download the agent. You can use the provided script to download it as instrumentation/pyroscope.jar
:
$ bash ./scripts/download-pyroscope-agent.sh
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 9782k 100 9782k 0 0 10.2M 0 --:--:-- --:--:-- --:--:-- 13.9M
Grafana Pyroscope agent downloaded successfully in ./scripts/../instrumentation
π Just like for traces, we should modify the easypay-service/src/main/docker/Dockerfile
file:
# ...
# Copy Java Agent into the container
COPY instrumentation/grafana-opentelemetry-java.jar /app/grafana-opentelemetry-java.jar
# Copy pyroscope Java Agent
COPY instrumentation/pyroscope.jar /app/pyroscope.jar
# Add the -javagent flag to setup the JVM to start with our Java Agent
ENTRYPOINT ["java", "-javaagent:/app/grafana-opentelemetry-java.jar", "-javaagent:/app/pyroscope.jar", "-cp","app:app/lib/*","com.worldline.easypay.EasypayServiceApplication"]
π Pyroscope can be configured with environment variables. Let's do it directly in the compose.yml
file:
easypay-service:
# ...
environment:
# ...
PYROSCOPE_APPLICATION_NAME: easypay-service # (1)
PYROSCOPE_FORMAT: jfr # (2)
PYROSCOPE_PROFILER_EVENT: wall # (3)
PYROSCOPE_PROFILER_LOCK: 10ms # (4)
PYROSCOPE_PROFILER_ALLOC: 512k # (5)
PYROSCOPE_LOG_LEVEL: debug
PYROSCOPE_SERVER_ADDRESS: http://pyroscope:4040 # (6)
# ...
service_name
label),wall
allows to record the time spent in methods. Other valid values are itimer
and cpu
.π οΈ Rebuild and redeploy easypay-service
:
$ docker compose build easypay-service
$ docker compose up -d easypay-service
β Check logs for correct startup:
docker compose logs -f easypay-service
You should see additional logs related to Pyroscope.
π Go back to the Pyroscope dashboard (port 4040
):
easypay-service
application,π Let's go to the Grafana Explore
dashboard:
Pyroscope
data source,wall
> wall
,easypay-service
: {service_name="easypay-service"}
,π οΈ Generate some load with k6
:
$ k6 run -u 1 -d 5m k6/02-payment-smartbank.js
π Try to find what is taking the most time in the ProcessPayment.processPayment
method: