New – Trigger a Kernel Panic to Diagnose Unresponsive EC2 Instances

When I was working on systems deployed in on-premises data centers, it sometimes happened I had to debug an unresponsive server. It usually involved asking someone to physically press a non-maskable interrupt (NMI) button on the frozen server or to send a signal to a command controller over a serial interface (yes, serial, such as in RS-232).This command triggered the system to dump the state of the frozen kernel to a file for further analysis. Such a file is usually called a core dump or a crash dump. The crash dump includes an image of the memory of the crashed process, the system registers, program counter, and other information useful in determining the root cause of the freeze.

Today, we are announcing a new Amazon Elastic Compute Cloud (EC2) API allowing you to remotely trigger the generation of a kernel panic on EC2 instances. The EC2:SendDiagnosticInterrupt API sends a diagnostic interrupt, similar to pressing a NMI button on a physical machine, to a running EC2 instance. It causes the instance’s hypervisor to send a non-maskable interrupt (NMI) to the operating system. The behaviour of your operating system when a NMI interrupt is received depends on its configuration. Typically, it involves entering into kernel panic. The kernel panic behaviour also depends on the operating system configuration, it might trigger the generation of the crash dump data file, obtain a backtrace, load a replacement kernel or restart the system.

You can control who in your organisation is authorized to use that API through IAM Policies, I will give an example below.

Cloud and System Engineers, or specialists in kernel diagnosis and debugging, find in the crash dump invaluable information to analyse the causes of a kernel freeze. Tools like WinDbg (on Windows) and crash (on Linux) can be used to inspect the dump.

Using Diagnostic Interrupt
Using this API is a three step process. First you need to configure the behavior of your OS when it receives the interrupt.

By default, our Windows Server AMIs have memory dump already turned on. Automatic restart after the memory dump has been saved is also selected. The default location for the memory dump file is %SystemRoot% which is equivalent to C:Windows.

You can access these options by going to :
Start > Control Panel > System > Advanced System Settings > Startup and Recovery

On Amazon Linux 2, you need to install and configurekdump & kexec. This is a one-time setup.

$ sudo yum install kexec-tools

Then edit the file /etc/default/grub to allocate the amount of memory to be reserved for the crash kernel. In this example, we reserve 160M by adding crashkernel=160M. The amount of memory to allocate depends on your instance’s memory size. The general recommendation is to test kdump to see if the allocated memory is sufficient. The kernel doc has the full syntax of the crashkernel kernel parameter.

GRUB_CMDLINE_LINUX_DEFAULT="crashkernel=160M console=tty0 console=ttyS0,115200n8 net.ifnames=0 biosdevname=0 nvme_core.io_timeout=4294967295 rd.emergency=poweroff rd.shell=0"

And rebuild the grub configuration:

$ sudo grub2-mkconfig -o /boot/grub2/grub.cfg

Finally edit /etc/sysctl.conf and add a line : kernel.unknown_nmi_panic=1. This tells the kernel to trigger a kernel panic upon receiving the interrupt.

You are now ready to reboot your instance. Be sure to include these commands in your user data script or in your AMI to automatically configure this on all your instances. Once the instance is rebooted, verify that kdump is correctly started.

$ systemctl status kdump.service
● kdump.service - Crash recovery kernel arming
   Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled)
   Active: active (exited) since Fri 2019-07-05 15:09:04 UTC; 3h 13min ago
  Process: 2494 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS)
 Main PID: 2494 (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/kdump.service

Jul 05 15:09:02 ip-172-31-15-244.ec2.internal systemd[1]: Starting Crash recovery kernel arming...
Jul 05 15:09:04 ip-172-31-15-244.ec2.internal kdumpctl[2494]: kexec: loaded kdump kernel
Jul 05 15:09:04 ip-172-31-15-244.ec2.internal kdumpctl[2494]: Starting kdump: [OK]
Jul 05 15:09:04 ip-172-31-15-244.ec2.internal systemd[1]: Started Crash recovery kernel arming.

Our documentation contains the instructions for other operating systems.

Once this one-time configuration is done, you’re ready for the second step, to trigger the API. You can do this from any machine where the AWS CLI or SDK is configured. For example :

$ aws ec2 send-diagnostic-interrupt --region us-east-1 --instance-id 

There is no return value from the CLI, this is expected. If you have a terminal session open on that instance, it disconnects. Your instance reboots. You reconnect to your instance, you find the crash dump in /var/crash.

The third and last step is to analyse the content of the crash dump. On Linux systems, you need to install the crash utility and the debugging symbols for your version of the kernel. Note that the kernel version should be the same that was captured by kdump. To find out which kernel you are currently running, use the uname -r command.

$ sudo yum install crash
$ sudo debuginfo-install kernel
$ sudo crash /usr/lib/debug/lib/modules/4.14.128-112.105.amzn2.x86_64/vmlinux /var/crash/127.0.0.1-2019-07-05-15:08:43/vmcore

crash 7.2.6-1.amzn2.0.1
... output suppressed for brevity ...

      KERNEL: /usr/lib/debug/lib/modules/4.14.128-112.105.amzn2.x86_64/vmlinux
    DUMPFILE: /var/crash/127.0.0.1-2019-07-05-15:08:43/vmcore  [PARTIAL DUMP]
        CPUS: 2
        DATE: Fri Jul  5 15:08:38 2019
      UPTIME: 00:07:23
LOAD AVERAGE: 0.00, 0.00, 0.00
       TASKS: 104
    NODENAME: ip-172-31-15-244.ec2.internal
     RELEASE: 4.14.128-112.105.amzn2.x86_64
     VERSION: #1 SMP Wed Jun 19 16:53:40 UTC 2019
     MACHINE: x86_64  (2500 Mhz)
      MEMORY: 7.9 GB
       PANIC: "Kernel panic - not syncing: NMI: Not continuing"
         PID: 0
     COMMAND: "swapper/0"
        TASK: ffffffff82013480  (1 of 2)  [THREAD_INFO: ffffffff82013480]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)

Collecting kernel crash dumps is often the only way to collect kernel debugging information, be sure to test this procedure frequently, in particular after updating your operating system or when you will create new AMIs.

Control Who Is Authorized to Send Diagnostic Interrupt
You can control who in your organisation is authorized to send the Diagnostic Interrupt, and to which instances, through IAM policies with resource-level permissions, like in the example below.

{
   "Version": "2012-10-17",
   "Statement": [
      {
      "Effect": "Allow",
      "Action": "ec2:SendDiagnosticInterrupt",
      "Resource": "arn:aws:ec2:region:account-id:instance/instance-id"
      }
   ]
}

Pricing
There are no additional charges for using this feature. However, as your instance continues to be in a ‘running’ state after it receives the diagnostic interrupt, instance billing will continue as usual.

Availability
You can send Diagnostic Interrupts to all EC2 instances powered by the AWS Nitro System, except A1 (Arm-based). This is C5, C5d, C5n, i3.metal, I3en, M5, M5a, M5ad, M5d, p3dn.24xlarge, R5, R5a, R5ad, R5d, T3, T3a, and Z1d as I write this.

The Diagnostic Interrupt API is now available in all public AWS Regions and GovCloud (US), you can start to use it today.

— seb

Originally posted on AWS News Blog
Author: Sébastien Stormacq

Deja una respuesta

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *