Simplified Time-Series Analysis with Amazon CloudWatch Contributor Insights

Inspecting multiple log groups and log streams can make it more difficult and time consuming to analyze and diagnose the impact of an issue in real time. What customers are affected? How badly? Are some affected more than others, or are outliers? Perhaps you performed deployment of an update using a staged rollout strategy and now want to know if any customers have hit issues or if everything is behaving as expected for the target customers before continuing further. All of the data points to help answer these questions is potentially buried in a mass of logs which engineers query to get ad-hoc measurements, or build and maintain custom dashboards to help track.

Amazon CloudWatch Contributor Insights, generally available today, is a new feature to help simplify analysis of Top-N contributors to time-series data in CloudWatch Logs that can help you more quickly understand who or what is impacting system and application performance, in real-time, at scale. This saves you time during an operational problem by helping you understand what is contributing to the operational issue and who or what is most affected. Amazon CloudWatch Contributor Insights can also help with ongoing analysis for system and business optimization by easily surfacing outliers, performance bottlenecks, top customers, or top utilized resources, all at a glance. In addition to logs, Amazon CloudWatch Contributor Insights can also be used with other products in the CloudWatch portfolio, including Metrics and Alarms.

Amazon CloudWatch Contributor Insights can analyze structured logs in either JSON or Common Log Format (CLF). Log data can be sourced from Amazon Elastic Compute Cloud (EC2) instances, AWS CloudTrail, Amazon Route 53, Apache Access and Error Logs, Amazon Virtual Private Cloud (VPC) Flow Logs, AWS Lambda Logs, and Amazon API Gateway Logs. You also have the choice of using structured logs published directly to CloudWatch, or using the CloudWatch Agent. Amazon CloudWatch Contributor Insights will evaluate these log events in real-time and display reports that show the top contributors and number of unique contributors in a dataset. A contributor is an aggregate metric based on dimensions contained as log fields in CloudWatch Logs, for example account-id, or interface-id in Amazon Virtual Private Cloud Flow Logs, or any other custom set of dimensions. You can sort and filter contributor data based on your own custom criteria. Report data from Amazon CloudWatch Contributor Insights can be displayed on CloudWatch dashboards, graphed alongside CloudWatch metrics, and added to CloudWatch alarms. For example customers can graph values from two Amazon CloudWatch Contributor Insights reports into a single metric describing the percentage of customers impacted by faults, and configure alarms to alert when this percentage breaches pre-defined thresholds.

Getting Started with Amazon CloudWatch Contributor Insights
To use Amazon CloudWatch Contributor Insights I simply need to define one or more rules. A rule is simply a snippet of data that defines what contextual data to extract for metrics reported from CloudWatch Logs. To configure a rule to identify the top contributors for a specific metric I supply three items of data – the log group (or groups), the dimensions for which the top contributors are evaluated, and filters to narrow down those top contributors. To do this, I head to the Amazon CloudWatch console dashboard and select Contributor Insights from the left-hand navigation links. This takes me to the Amazon CloudWatch Contributor Insights home where I can click Create a rule to get started.

To get started quickly, I can select from a library of sample rules for various services that send logs to CloudWatch Logs. You can see above that there are currently a variety of sample rules for Amazon API Gateway, Amazon Route 53 Query Logs, Amazon Virtual Private Cloud Flow Logs, and logs for container services. Alternatively, I can define my own rules, as I’ll do in the rest of this post.

Let’s say I have a deployed application that is publishing structured log data in JSON format directly to CloudWatch Logs. This application has two API versions, one that has been used for some time and is considered stable, and a second that I have just started to roll out to my customers. I want to know as early as possible if anyone who has received the new version, targeting the new api, is receiving any faults and how many faults are being triggered. My stable api version is sending log data to one log group and my new version is using a different group, so I need to monitor multiple log groups (since I also want to know if anyone is experiencing any error, regardless of version).

The JSON to define my rule, to report on 500 errors coming from any of my APIs, and to use account ID, HTTP method, and resource path as dimensions, is shown below.

{
  "Schema": {
    "Name": "CloudWatchLogRule",
    "Version": 1
  },
  "AggregateOn": "Count",
  "Contribution": {
    "Filters": [
      {
        "Match": "$.status",
        "EqualTo": 500
      }
    ],
    "Keys": [
      "$.accountId",
      "$.httpMethod",
      "$.resourcePath"
    ]
  },
  "LogFormat": "JSON",
  "LogGroupNames": [
    "MyApplicationLogsV*"
  ]
}

I can set up my rule using either the Wizard tab, or I can paste the JSON above into the Rule body field on the Syntax tab. Even though I have the JSON above, I’ll show using the Wizard tab in this post and you can see the completed fields below. When selecting log groups I can either select them from the drop down, if they already exist, or I can use wildcard syntax in the Select by prefix match option (MyApplicationLogsV* for example).

Clicking Create saves the new rule and makes it immediately start processing and analyzing data (unless I elect to create it in disabled state of course). Note that Amazon CloudWatch Contributor Insights processes new log data created once the rule is active, it does not perform historical inspection, so I need to build rules for operational scenarios that I anticipate happening in future.

With the rule in place I need to start generating some log data! To do that I’m going to use a script, written using the AWS Tools for PowerShell, to simulate my deployed application being invoked by a set of customers. Of those customers, a select few (let’s call them the unfortunate ones) will be directed to the new API version which will randomly fail on HTTP POST requests. Customers using the old API version will always succeed. The script, which runs for 5000 iterations, is shown below. The cmdlets being used to work with CloudWatch Logs are the ones with CWL in the name, for example Write-CWLLogEvent.

# Set up some random customer ids, and select a third of them to be our unfortunates
# who will experience random errors due to a bad api update being shipped!
$allCustomerIds = @( 1..15 | % { Get-Random })
$faultingIds = $allCustomerIds | Get-Random -Count 5

# Setup some log groups
$group1 = 'MyApplicationLogsV1'
$group2 = 'MyApplicationLogsV2'
$stream = "MyApplicationLogStream"

# When writing to a log stream we need to specify a sequencing token
$group1Sequence = $null
$group2Sequence = $null

$group1, $group2 | % {
    if (!(Get-CWLLogGroup -LogGroupName $_)) {
        New-CWLLogGroup -LogGroupName $_
        New-CWLLogStream -LogGroupName $_ -LogStreamName $stream
    } else {
        # When the log group and stream exist, we need to seed the sequence token to
        # the next expected value
        $logstream = Get-CWLLogStream -LogGroupName $_ -LogStreamName $stream
        $token = $logstream.UploadSequenceToken
        if ($_ -eq $group1) {
            $group1Sequence = $token
        } else {
            $group2Sequence = $token
        }
    }
}

# generate some log data with random failures for the subset of customers
1..5000 | % {

    Write-Host "Log event iteration $_" # just so we know where we are progress-wise

    $customerId = Get-Random $allCustomerIds

    # first select whether the user called the v1 or the v2 api
    $useV2Api = ((Get-Random) % 2 -eq 1)
    if ($useV2Api) {
        $resourcePath = '/api/v2/some/resource/path/'
        $targetLogGroup = $group2
        $nextToken = $group2Sequence
    } else {
        $resourcePath = '/api/v1/some/resource/path/'
        $targetLogGroup = $group1
        $nextToken = $group1Sequence
    }

    # now select whether they failed or not. GET requests for all customers on
    # all api paths succeed. POST requests to the v2 api fail for a subset of
    # customers.
    $status = 200
    $errorMessage = ''
    if ((Get-Random) % 2 -eq 0) {
        $httpMethod = "GET"
    } else {
        $httpMethod = "POST"
        if ($useV2Api -And $faultingIds.Contains($customerId)) {
            $status = 500
            $errorMessage = 'Uh-oh, something went wrong...'
        }
    }

    # Write an event and gather the sequence token for the next event
    $nextToken = Write-CWLLogEvent -LogGroupName $targetLogGroup -LogStreamName $stream -SequenceToken $nextToken -LogEvent @{
        TimeStamp = [DateTime]::UtcNow
        Message = (ConvertTo-Json -Compress -InputObject @{
            requestId = [Guid]::NewGuid().ToString("D")
            httpMethod = $httpMethod
            resourcePath = $resourcePath
            status = $status
            protocol = "HTTP/1.1"
            accountId = $customerId
            errorMessage = $errorMessage
        })
    }

    if ($targetLogGroup -eq $group1) {
        $group1Sequence = $nextToken
    } else {
        $group2Sequence = $nextToken
    }

    Start-Sleep -Seconds 0.25
}

I start the script running, and with my rule enabled, I start to see failures show up in my graph. Below is a snapshot after several minutes of running the script. I can clearly see a subset of my simulated customers are having issues with HTTP POST requests to the new v2 API.

From the Actions pull down in the Rules panel, I could now go on to create a single metric from this report, describing the percentage of customers impacted by faults, and then configure an alarm on this metric to alert when this percentage breaches pre-defined thresholds.

For the sample scenario outlined here I would use the alarm to halt the rollout of the new API if it fired, preventing the impact spreading to additional customers, while investigation of what is behind the increased faults is performed. Details on how to set up metrics and alarms can be found in the user guide.

Amazon CloudWatch Contributor Insights is available now to users in all commercial AWS Regions, including China and GovCloud.

— Steve

Originally posted on AWS News Blog
Author: Steve Roberts

Deja una respuesta

Tu dirección de correo electrónico no será publicada. Los campos obligatorios están marcados con *