Tuesday, June 14, 2016

Troubleshooting Real World Examples

1. Trouble shooting AWS java stack trace. Yuzu. Concepts that is required to solve this problem: Serialization, accessbility to ports in AWS.
2. Fixing mysterious bug due to violation of invariant. Streetline.
3. Fixing caching issue. Yuzu. When to invalidate the cache?
4. IoT project. Health check API. Cisco. How to isolate the problem?
5. The system coming to a halt for mysterious reason. Asia Online.

Turned out to be a script that populated the database and inserted records that violated the database constraints. This choked the system and brought it down.

Any dependency or integration point should  have a test. Why? Because, it will help you to isolate the problem quickly. For example, test for integrating with mailchimp or sendgrid. Any assumptions about the infrastructure should also have tests. For example, automated build does in fact complete successfully. Using droid to build docker images and deploy the web app failed silently. So, automated tests can isolate the environment issue from code issue.

Troubleshooting Questions

  • Did this ever work?
  • When did this stop working? Find the delta to narrow down the search space.
  • Does this happen in all environments? If not, what is the different in that environment?
Unlike the learning process where there is no stupid question, in troubleshooting problems there is such a thing as stupid question. You can develop the ability to ask intelligent questions and fix problems quickly by having a good understanding of the concepts. Did you know that networks are not 100% reliable? How does your software behave when it encounters the error? You need to have an open mind. Avoid bias when investigating the problem. Do you rule out any possibility early in the troubleshooting process.

Check Version

Dependent software versions, build versions of the deployed software, OS versions etc.

Problem Categories

All problems can be categorized into three main categories:

1. Environment Issue (software versions, flaky network etc)
2. Code Issue (bugs)
3. Data Issue  (violating data constraint)

By asking intelligent questions, you will be able to rule out certain likely causes and categorize the problem into the right category. You can then test your hypothesis to see if that will fix the problem.

Meaning of Nil

nil means something. It has a meaning. It could be absence of something or an external system is down. Always express the meaning of nil explicitly in code and throw an exception that describes why some variable became nil. This will help us to quickly find the cause of the problem and deal with it in a timely fashion.

Fail Loudly

Why?
How?

Isolating the Problem

Always start with a working version
Revert to working version
Keep removing code until the problem disappears. Gradually add code to re-create the problem and isolate the problem.

Example of fixing Lightbox 2 with Rails 5?

No comments:

Post a Comment