Skip to content

Amazon CloudWatch

Introduction

Amazon CloudWatch is a monitoring and observability service that provides data and actionable insights for AWS resources and applications. It collects monitoring data in the form of logs, metrics, and events.

CloudWatch Overview

Key Features

  • Metrics - Collect and track AWS and custom metrics
  • Logs - Centralize logs from all systems
  • Alarms - React to metric thresholds
  • Dashboards - Visualize metrics and logs
  • Events/EventBridge - Respond to system events
  • Insights - Query and analyze logs
  • Anomaly Detection - ML-powered metric anomalies

When to Use

Ideal Use Cases

  • Infrastructure monitoring - EC2, RDS, Lambda metrics
  • Application monitoring - Custom metrics, logs
  • Operational visibility - Dashboards, alerts
  • Log aggregation - Centralize all logs
  • Troubleshooting - Query logs, trace issues
  • Auto scaling triggers - Scale based on metrics

Core Components

Metrics

  • Time-ordered data points
  • Organized by namespaces (AWS/EC2, Custom)
  • Dimensions for filtering (InstanceId, etc.)
  • Standard (5-min) or detailed (1-min) resolution

Common AWS Metrics

Service Key Metrics
EC2 CPUUtilization, NetworkIn/Out, DiskReadOps
RDS DatabaseConnections, FreeStorageSpace, ReadLatency
Lambda Invocations, Duration, Errors, Throttles
ELB RequestCount, TargetResponseTime, HTTPCode
S3 BucketSizeBytes, NumberOfObjects

Logs

  • Log groups - containers for log streams
  • Log streams - sequence of events from same source
  • Log events - individual log entries
  • Retention configurable (1 day to never expire)

Alarms

  • Watch single metric
  • Three states: OK, ALARM, INSUFFICIENT_DATA
  • Actions: SNS, Auto Scaling, EC2 actions
  • Can be composite (multiple conditions)

What to Be Careful About

Cost Management

  • Custom metrics - $0.30/metric/month (first 10,000)
  • Log ingestion - $0.50/GB
  • Log storage - $0.03/GB/month
  • Dashboard cost - $3/dashboard/month
  • Alarms - $0.10/alarm/month (standard)
  • API calls - GetMetricData, GetMetricStatistics charged

Log Management

  • Retention settings - Set appropriate retention (don't keep forever)
  • Log volume - High volume = high cost
  • Log format - Structured logs easier to query
  • Metric filters - Create metrics from log patterns

Metrics

  • Resolution - Standard (5-min) vs high-resolution (1-sec)
  • Aggregation - Understand statistics (Average, Sum, Max, etc.)
  • Dimensions - Can't add dimensions after publishing
  • Missing data - Configure alarm behavior for missing data

Alarms

  • Evaluation periods - Too short may cause flapping
  • Data points to alarm - M out of N evaluation
  • Actions - Ensure SNS topics, IAM roles configured
  • Composite alarms - Reduce alarm noise

CloudWatch Logs Insights

Query language for log analysis:

fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20

Common Queries

# Count errors per hour
fields @timestamp
| filter @message like /ERROR/
| stats count(*) by bin(1h)

# Find slow Lambda executions
fields @requestId, @duration
| filter @duration > 1000
| sort @duration desc

# Parse JSON logs
fields @timestamp
| parse @message '{"level":"*","message":"*"}' as level, msg
| filter level = "ERROR"

Custom Metrics

Publishing Custom Metrics

import boto3

cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_data(
    Namespace='MyApp',
    MetricData=[{
        'MetricName': 'PageViews',
        'Value': 1,
        'Unit': 'Count',
        'Dimensions': [
            {'Name': 'Page', 'Value': '/home'}
        ]
    }]
)

High-Resolution Metrics

  • 1-second resolution
  • Higher cost
  • Use for time-sensitive monitoring
  • Short retention at full resolution

Common Interview Questions

  1. What's the difference between CloudWatch Metrics and Logs?
  2. Metrics: Numeric time-series data, aggregated
  3. Logs: Text-based event records, searchable
  4. Metrics for monitoring, Logs for debugging

  5. How do you create an alarm for a custom metric?

  6. Publish custom metric using PutMetricData API
  7. Create alarm on the metric
  8. Configure threshold, period, evaluation
  9. Set alarm actions (SNS, Auto Scaling, etc.)

  10. What is a CloudWatch Agent?

  11. Software agent for EC2/on-premises
  12. Collects system metrics (memory, disk)
  13. Collects logs
  14. Sends to CloudWatch

  15. How do you reduce CloudWatch costs?

  16. Set appropriate log retention
  17. Use metric filters instead of custom metrics where possible
  18. Reduce high-resolution metrics
  19. Use Logs Insights instead of continuous queries
  20. Archive old logs to S3

  21. What is Container Insights?

  22. Collect metrics from ECS, EKS, Kubernetes
  23. Pre-built dashboards
  24. Performance monitoring for containers
  25. Additional cost per container

CloudWatch Agent

Capabilities

  • Collect additional EC2 metrics (memory, disk)
  • Collect custom application logs
  • Works on-premises too
  • StatsD and collectd support

Configuration

{
  "metrics": {
    "metrics_collected": {
      "mem": {"measurement": ["mem_used_percent"]},
      "disk": {"measurement": ["used_percent"]}
    }
  },
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [{
          "file_path": "/var/log/app.log",
          "log_group_name": "app-logs"
        }]
      }
    }
  }
}

Dashboards

Features

  • Custom visualizations
  • Multiple metric sources
  • Auto-refresh
  • Cross-account, cross-region
  • Shareable links

Widget Types

  • Line graphs
  • Stacked area
  • Number (single value)
  • Gauge
  • Text (markdown)
  • Query results (Logs Insights)
  • Alarm status

Alternatives

AWS Alternatives

Service When to Use Instead
X-Ray Distributed tracing
CloudTrail API audit logging
EventBridge Event routing (evolved from CW Events)
Managed Grafana Advanced visualization
Managed Prometheus Prometheus-compatible metrics

External Alternatives

Provider Service
Datadog Full-stack monitoring
New Relic APM and monitoring
Splunk Log analytics
Grafana Visualization
Prometheus Metrics collection

Best Practices

  1. Enable detailed monitoring - 1-minute metrics for critical resources
  2. Set log retention - Don't store logs forever
  3. Use structured logging - JSON format for easier querying
  4. Create dashboards - Operational visibility
  5. Use composite alarms - Reduce alert fatigue
  6. Configure alarm actions - Automate responses
  7. Use metric math - Combine metrics for insights
  8. Enable anomaly detection - Catch unusual patterns
  9. Use Contributor Insights - Top-N analysis
  10. Archive to S3 - Long-term log retention

Pricing Summary

Component Cost
Basic metrics Free (5-min, default AWS metrics)
Custom metrics $0.30/metric/month (first 10K)
High-resolution $0.30/metric/month
Dashboards $3/dashboard/month
Alarms $0.10/alarm/month (standard)
Log ingestion $0.50/GB
Log storage $0.03/GB/month
Logs Insights $0.005/GB scanned
Container Insights Per container pricing