> ## Documentation Index
> Fetch the complete documentation index at: https://docs.agno.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Monitoring AgentOS on AWS

> View logs and monitor your AWS deployment

CloudWatch Logs captures output from your ECS containers.

## View Logs

Tail logs in real-time:

```bash theme={null}
aws logs tail {infra_name}-prd-api --follow
```

Search recent logs:

```bash theme={null}
aws logs filter-log-events \
  --log-group-name {infra_name}-prd-api \
  --filter-pattern "ERROR" \
  --start-time $(date -d '1 hour ago' +%s)000
```

<Note>
  Replace `{infra_name}` with your `infra_name` from `settings.py` (e.g.,
  `agentos-aws-template`).
</Note>

## ECS Service Status

View service status and recent events:

```bash theme={null}
aws ecs describe-services \
  --cluster {infra_name}-prd \
  --services {infra_name}-prd-api-service \
  --query 'services[0].{status:status,running:runningCount,desired:desiredCount,events:events[:5]}'
```

List running tasks:

```bash theme={null}
aws ecs list-tasks --cluster {infra_name}-prd
```

## What Success Looks Like

After a successful deployment, logs show:

```
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000
```

Health check passing:

```
INFO:     192.168.x.x - "GET /health HTTP/1.1" 200 OK
```

## Warning Signs

| Log Pattern                 | Meaning                  | Action                    |
| --------------------------- | ------------------------ | ------------------------- |
| `database is locked`        | DuckDB concurrency issue | Reduce workers to 1       |
| `connection refused`        | Can't reach RDS          | Check security group      |
| `OOMKilled`                 | Out of memory            | Increase task memory      |
| `CannotPullContainerError`  | ECR auth expired         | Re-run `auth_ecr.sh`      |
| `SIGTERM` then restart loop | Health check failing     | Check app logs for errors |

## Health Checks

The load balancer checks `/health` every 30 seconds.

| Target Status | Meaning                    |
| ------------- | -------------------------- |
| healthy       | Task passing health checks |
| unhealthy     | Health check failing       |
| draining      | Task being replaced        |

If unhealthy, check:

1. Container logs for startup errors
2. Security group allows port 8000 from ALB
3. Database connectivity (`DB_HOST`, `DB_PASS`)

## Log Retention

CloudWatch retains logs indefinitely by default. Set a retention policy to control costs:

```bash theme={null}
aws logs put-retention-policy \
  --log-group-name {infra_name}-prd-api \
  --retention-in-days 30
```

| Retention | Monthly Cost (10GB/day) |
| --------- | ----------------------- |
| 7 days    | \~\$3                   |
| 30 days   | \~\$15                  |
| 90 days   | \~\$45                  |

## Alerts (Optional)

Create a CloudWatch alarm for task failures:

```bash theme={null}
aws cloudwatch put-metric-alarm \
  --alarm-name "{infra_name}-task-failures" \
  --metric-name "FailedTasks" \
  --namespace "AWS/ECS" \
  --statistic Sum \
  --period 300 \
  --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --dimensions Name=ClusterName,Value={infra_name}-prd \
  --evaluation-periods 1 \
  --alarm-actions [YOUR_SNS_TOPIC_ARN]
```

See [AWS SNS documentation](https://docs.aws.amazon.com/sns/latest/dg/sns-create-topic.html) to create a notification topic.
