Python application crashes without throwing any exception


PROBLEM:

Recently I have written a flask application and encountered a problem that the application randomly dies. My application is to provide a simple REST API to query some personal information. It serves both HTTP and HTTPS. I run the app by python command, just like below
python my_app.py

(Yeah, I know that command is for development only and should not be used in production)
Everything went well for a couple of tests, til the rainy time, literally, there was a heavy rain outside. I sent a HTTPS request using postman and the above command just stopped. I tried several times, the issue did not happen permanently. Sometimes it crashed, sometimes it went well.
There was not any information dumping out to screen. The command was just completed with a lame terminal's prompt.

HOW I DEAL WITH IT:

I was just a novice in python and was expecting for an exception that would tell me the exact reason why my application had been crashed. Well, there was nothing. I googled the problem and got some hint that the process might be killed by OOM killer. I checked for syslog. No low memory situation, no killing signal.
I even thought that there was something wrong with the python program itself and crashing meant being killed by some other process. I used strace to track down every signal sending to python process. My bash routine is as below.
# look for the process ID (pid) of my python application
ps -ef | grep python

# invoke strace, replace <PID> with the PID found in previous command
sudo strace -p <PID>

# do the scenario to make python crashed
# see strace output

And BINGO! I found the reason why the python app got crashed. There was a SIGSEGV. There was a memory issue. Why didn't I notice this sooner?!
We have several ways to deal with a memory issue. One of them is getting a core dump and debugging using gdb. I've found an interesting and useful article talking about this. It's worth to go though: How to get a core dump for a segfault on Linux

In summary, there are 3 steps to analyze the core file

1. Set the max size of a core dump
ulimit -c unlimited
sudo sysctl -w kernel.core_pattern=/tmp/core-%e.%p.%h.%t

# sample output could be like
# kernel.core_pattern = /tmp/core-%e.%p.%h.%t

2. Run your program under testing (here was my python command)
Run your program, do the scenario to make system die (SIGSEGV)
A core file will be created if problem happens.
E.g. /tmp/core-python.12924.localhost.1560412461

3. Debug using gdb
sudo gdb -c /tmp/core-python.12924.localhost.1560412461

Some helpful commands in gdb terminal:

- load all shared libs symbols
sharedlibrary

- display stack trace
backtrace

- Threading
thread info

By above steps I was able to see what program caused the memory issue. It was a third-party shared library written in C.

CONCLUSION:

Python is not that powerful yet to detect and prevent underlying issue like being killed, memory segmentation fault,... so don't expect too much on try/catch mechanism. After all it has to comply the OS rule is that there are some signals that will never be caught and the program has no choice but forced shutting down. Next time if you see your program dies unintentionally and without information, I hope you'll find that strace and gdb are your friends.

Comments