Licence: Mit Licence
Providing a strongly opinionated perspective derived from external recommendations and personal experience, this document describes how to implement logging effectively. It includes guidance on best practices, configuration, and conventions. This document is targeted at developers who are responsible for the logging of an application.
When we discuss logging, we refer to the process of recording information about the execution of a program. This information can be used for debugging or monitoring purposes. The logging actor should be the application itself, not the framework, or the libraries, or the user journeys. The audience should be developers and anyone who could understand and act upon the information logged.
Do not log any sensitive customer information (e.g. credentials, confidential/personal data, security tokens). If you must include any sensitive information, use trace level and never enable that level in a production environment, or in any environment that uses actual customer data. Failing to do so, will result in data breaches, GDPR infringements, confidential information will leak in to systems that are not designed to protect or store such data. This is relevant in the frontend and the backend systems.
If you do not completely sure if the data is confidential/personal, do not log it. Some examples:
-
PII (Personally Identifiable Information)
-
Direct identifiers: Full name, social security number / national ID, driver’s license number, Passport number, personal email address, phone number, home address.
-
Indirect or quasi-identifiers (may not identify a person alone, but can if combined with others): date of birth, ZIP/Postal code, gender, employer, GPS location.
-
Special category personal attributes (per GDPR & similar laws): racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, sexual orientation, gender identity
-
-
PHI (Protected Health Information): medical history, diagnoses, insurance, lab results
-
FCI (Financial Customer Information): bank account numbers, credit card data, investment info.
Do not trust third-party libraries. If you are using a third-party library, do not trust that it will not log sensitive information. Use a safe level (e.g. warn) or disable the logging in the library. Some libraries log sensitive information even at error level (requests, responses…). Open an issue in the library to fix it and disconnect the logging in the library until it is fixed.
Do not log any secret information (e.g. passwords, tokens, keys). As in the previous section, if you must do so, use trace level, and never enable that level in a production environment. Failing to do so will result in leaking system secrets, which could lead to data breaches.
DTOs, Entities, Aggregates could have fields with personal or confidential information. Even if they are not logged we should consider overwriting toString() method to prevent data leakage by mistake.
We could consider three options:
-
Omit
-
Redact
-
Mask
The recommended option is "omit". It is the simplest, most performant, and resistant to refactors.
Simple Java example:
record User(Integer id, String name, String email, LocalDate birthday) {
@Override
public String toString() {
return User.class.getSimpleName() + " {id=" + id + "}";
}
}Using StringJoiner for complex scenarios:
record User(Integer id, String name, String email, LocalDate birthday) {
@Override
public String toString() {
// Useful when there are many fields
return new StringJoiner(", ", User.class.getSimpleName() + "{", "}")
.add("id=" + id)
.toString();
}
}Do not use logging as an audit replacement. Logging is not a substitute for auditing. Auditing is a separate process that should be implemented differently. Logging is for debugging and monitoring, while auditing is for compliance, security and user activity tracking.
-
They could have different retention policies.
-
They could have different storage systems.
-
They have different access policies: Auditing should be restricted to a few people because it could contain personal and confidential data, while logging should be accessible to a larger number of people.
-
The content is different: Auditing should contain information about user activity, while logging should contain information about system activity.
Do not use logging as a traceability replacement. Traceability is a separate process that should be implemented differently. Logging is for debugging and monitoring, while traceability is for tracking user journeys, system interactions, and business processes.
-
They could have different retention policies.
-
They could have different storage systems.
-
Traceability might not have what you are looking for. Usually, and especially in high traffic applications, not all journeys are recorded.
Do not use logging as a metrics replacement. Metrics are a separate process that should be implemented differently. Logging is for debugging and monitoring, while metrics are for measuring performance, usage, alerts, and other quantitative aspects of the system.
-
They could have different retention policies.
-
They could have different storage systems: Metrics are usually aggregated and stored in a time-series database, while logging is stored in a log management system.
Logging can have an impact on performance, and it could lead to an increased cost (e.g. Cloud Services), especially if it is not implemented correctly. It is important to use logging judiciously and to avoid logging too much information. You could design the most performant logging system: async, non-blocking, etc., but if you log too much information, it will choke the system and degrade performance. In the end, every log will behave synchronously, or it will be dropped. Every log message has a cost, normally negligible, but use them wisely.
This section describes what is the purpose of different log levels, for framework-specific documentation see the references section.
-
fatal,critical: Mostly crashes and system failures (e.g. the service fails to start). Not available on every platform, and it usually writes tostderr. After writing at this level, the process usually ends. -
error: Problems that have an impact on the normal behaviour of the application and could cause problems for customers. Frequently requires immediate action. It should contain as much non-redundant information as possible, but take care not to expose confidential data. -
warn: Any warning that has no impact on the normal behaviour of the application or customers, but must be amended by the clients or the application itself. It should include the information needed to trace the issue but remain succinct. For example, this level should detect API client validation errors, reverse engineering attempts, circuit breakers open events, retries. Often does not require immediate action. -
info/config: For system events like initialisations and shutdowns. Use simple sentences. Do not use this level for logging every user interaction, there are other ways to monitor the customer journeys like traceability, audit, metrics, or events. By failing to do so, the log storage systems will be inundated with useless information, which could result in an increase in costs. -
debug/fine: Any information that helps a developer to understand the flow of the application and debugging issues. Do not include big chunks of data, usetracelevel instead. -
trace/finer/finest: Extended information that helps to debug the application, big chucks of data. If trace level is not available, usedebugfor the same purpose, but do not write personal or confidential data.
An application that behaves normally should produce very few messages in production or even none.
Although, info level is no replacement for other solutions and could have large operational costs, here are some examples of when it makes sense to use info level:
-
Starting Services
-
"Starting service X on port Y"
-
"Email service and waiting for events"
-
-
Shutting down services:
-
"Shutting down service X"
-
"Email service stopped"
-
-
Configuration Loaded or Reloaded (or any other configuration that is not sensitive):
-
"Production configuration loaded"
-
"Service started with 5 threads"
-
-
Administrative actions:
-
"Cache forced eviction by Administrator".
-
"New scheduler task created by Administrator"
-
-
Non-frequent events that are not errors:
-
"Batch job started"
-
"Batch job finished after processing 1000 records" (if there is an error, consider
warnorerrorlevel instead)
-
Following these practices will declutter your log, and make it easier to monitor the application and debug errors and warnings.
On controller/adapter logs:
-
Responses with HTTP status
5xxshould write at least oneerrorin the logs. -
Responses with HTTP status
4xxshould writewarn,debugortracelogs. Logs of typewarnshould be succinct or avoided for high traffic applications. You could use access logs instead. -
Responses with HTTP status
2xxor3xxshould write indebugortraceas top level.
Other layers of the application could produce other logs like error, warn, debug or trace.
As you could see, there is no info level.
If you have control over the library, it should use only debug and trace. Any exception produced inside the library should be thrown and documented, but not logged. The consumer of the library should consider if the exception should be logged or not. The usage and type of the exceptions are out of the scope of this document.
If you do not have control over the library, try to disable logging in the library or use a safe level (e.g. warn).
If the library/framework is used as runtime (e.g. Spring Boot, Express.js, KoaJS), it could log in info level some data, like startups and shutdowns. For example, a web server or a database connection pool.
|
Note
|
Setting, for example, info means that warn, error and fatal are also enabled.
|
-
Environments that use personal/confidential data or runs performance tests: Use
info(e.g. Production, Friends and Family, Partners acceptance…) -
Any other environment:
info,debugortrace. Consider costs and enable only relevant loggers. (e.g. Preproduction, Development, QA, Integration, Local)
Some loggers/modules/packages will need adjustments: Instead of info, consider using warn or error for external libraries if they log too much information or sensitive data.
Structured logging is a way to log information in a structured format, such as JSON. This allows for easier parsing and analysis of log data by a machine. It is especially useful when working with large amounts of log data or when integrating with log management systems.
But structured logging may not be the best option in local environments, where the log is parsed by humans. In this case, it is better to use a human-readable format.
There are many ways to configure your app to log in different formats in different environments.
There is no such thing as a bug-free software. Therefore, every time an error occurs in any environment, it should be logged. Think about the following:
-
What information should be logged to help you to debug the issue? (No personal or confidential data)
-
Is there enough contextual information?
-
Is it clear who is the culprit?
-
Is the level appropriate? (e.g.
errororwarn) -
We are adding too much noise to the log?
This is not a one-time task, but a continuous process. Every time you find an issue, you should review the logging configuration and the log messages to ensure that they are appropriate for debugging and post-mortem analysis.
If, for some reason, you have to enable the debug level in production, you should consider the following:
-
If possible, only enable the
debuglevel for the component that is failing, not for the whole application. -
Ensure that the
debuglevel of those components does not log personal or confidential data. -
If possible, restrict the traffic that you want to log, to avoid security or performance issues.
Logging too much too often can lead to a situation where the log system becomes overwhelmed, degrading performance, filling up storage, and making it difficult to find relevant information.
This situation could happen when there is an error in a third-party library/dependency, and every request ends up logging the same error over and over again.
One way to avoid this situation and make the application resilient is to use a circuit breaker pattern. This pattern allows you to detect when a service is failing and stop sending requests. This is going to solve not only the logging issue, but also the performance and availability problems that could arise from a failing service. Before writing the log, the code should check whether the circuit breaker is open or not (e.g. normally the exception is different when the circuit is open)
MDC is a mechanism that allows you to add contextual information to log messages as key-value pairs.
Some examples:
-
Actor Information (e.g. user ID, actor type)
-
Current Aggregate ID (e.g. policy ID, account ID)
-
Trace Information (e.g. trace ID, span ID)
As always, do not log personal or confidential data. And remember to clean the MDC after the request is processed; otherwise you could leak information to other requests.
SLF4J and Logback support MDC natively, and you can use it in your application to add contextual information to your log messages:
public void processRequest(String userId, String policyId) {
// Add contextual information to the MDC
MDC.put("userId", userId);
MDC.put("policyId", policyId);
try {
// Process the request log will include userId and policyId
logger.debug("Processing request");
} finally {
// Clean up the MDC
MDC.remove("userId");
MDC.remove("policyId");
}
}-
Avoid superfluous string concatenation when possible. Examples:
-
SLF4J: Instead of
logger.trace("Number of bananas: " + bananas.length);, dologger.trace("Number of bananas: {}", bananas.length);
-
-
If an item requires extensive logic or creating resources, use conditional logging or suppliers. Examples:
-
SLF4J: Use
logger.isDebugEnabled()oflogger.atDebug() -
Log4j: Use
logger.debug(() → "Number of bananas: " + someExpensiveCalculations());
-
-
Add component information: Almost every log library could create child loggers or add component information to every entry. This is helpful not only to trace back the component, but to enable or disable levels per component. Examples:
-
Logback (SLF4J) level for package:
<logger name="chapters.configuration" level="INFO"/> -
Spring Boot (SLF4J) level for package:
logging.level.org.example.application.domain=INFO
-
-
Do not log errors more than once: If you throw an exception, do not log it, use catch or interception mechanisms (e.g. advisor) to log only once and use a stack to find out the root cause.
-
Banners and logos: We all love a good banner, but they are not useful in production. They are just noise and take space. Disconnect them in by default, especially in production or when structured logging is enabled.
-
Java Libraries
-
SLF4J: Simple Logging Facade for Java
-
Logback — SLF4J Implementation: SLF4J Implementation
-
-
Node.js
-
Best practices
-
Logging Levels
@jabrena Thank you for your comments, I will try to address all your points.
OptionalandResultapproach, but you are going to lose some valuable information, for example, the stack trace. Filling the stack trace is a costly operation but will be negligible in most of our applications with plenty of parsing and IO. The actual problem is deciding inside the library what is exceptional for the consumer. An HTTP client library could decide to throw an exception when it gets a 500 status, but maybe the application that uses the library is a monitor of the status of several distributed services and a 500 is expected and not exceptional. But that is a subject out of the scope of this document.Keep tune for the next revision!!