Application Logging (in IBM BPM using Splunk)
My new favorite thing is good application logging (and what it allows you to achieve). I know that the likes of Google, Facebook etc. have been doing it for decades but in the old-school Enterprise it is still relatively uncommon.
1. Discipline
You can only reap the rewards if you take a disciplined approach to application logging. This means a couple of things:
1.1. Consistency
It's important to route all logging through a single interface. This is software development 101 but needs to be put in place as early as possible (or it can become very expensive).
In IBM BPM we use a set of server-side JavaScript functions with corresponding Services. The benefit of having a server-side JavaScript function (opposed to just using Services) is that you can leverage the logging in more places (e.g. in the catch block of a Server Script). Please note though, that use of the Service should always be the preferred approach.
Have a single point of contact makes it easy to implement (and maintain) consistent log syntax and semantics. An example function (and it's corresponding Service) is shown below. The value of the "name=value" structure will become apparent when we look at the rewards.
1.2. Grouping
You will need to strike a balance between one-size-fits-all logging and having log entries that are appropriate for the information being recorded. I follow a less is more approach and our interface is relatively simple.
function logPerformance(component, duration, context) {} function logAction(action, context) {} function logError(component, error, context) {} function logInfo(info, context) {}
1.3. Coverage
You will need to call the application logging whenever appropriate (but only when appropriate). Gaps in coverage mean that you can't paint clear pictures (and draw firm conclusions) from your log entries and excessive entries mean that you can't see the wood for the trees.
This is an essential part of maximizing the value of application logging and should be under constant review. It's also very important that different team members are equally judicious in their use of application logging.
1.4. Detail
Common details (e.g. User, Task, Process) should be included in log entries (whenever these values exist). Similarly, entry-specific detail (e.g. relevant business data) should also be included if it's likely to be helpful in the future.
2. Rewards
2.1. Cost of Curiosity
It's common to have to answer ad-hoc questions about applications. A combination of disciplined code and appropriate tools (e.g. Splunk) make it very easy to ask these questions. Even better, is when the answers to these questions cause new questions to be asked (that wouldn't even have been thought of before).
2.1.1. Give me a complete Audit History for Case X. Which Users did what and when?
index="btbpmprd" app="BTFP" processId="624970"
2.1.2. Which pages took more that five seconds to load today?
index="btbpmprd" app="BTFP" duration>5000
2.1.3. What's the busiest day of the week or time of the day?
index="btbpmprd" app="BTFP" logType="ACTION" | timechart span=1h count by 1
2.1.4. Which Users upload the most Documents?
index="btbpmprd" app="BTFP" logType="ACTION" action="File Upload" | timechart span=1d count by userId limit=100
2.1.5. How much faster do pages load for Users with IE10 than Users with IE8?
index="btbpmprd" app="BTFP" logType="PERFORMANCE" | timechart avg(totalTime) by browser
2.2. Live Documentation
We have an NFR Dashboard that we use to drive our Performance Testing. It presents the volume (and, where appropriate, response times) of all the important activity. It's always up to date and it can be queried over any desired interval.
We can leverage the same Dashboard in our Production and Performance Test environments to ensure that our Performance Test runs provide an accurate model of Production.
2.3. System Status
We have a Production Status Dashboard that lets us monitor (in pretty much real time) what's going on with the application. It captures key user actions, errors, the number of active users and the flow of activity in to and out of the system.