Monitoring

Google Cloud Operations suite

previously stackdriver
Integrated monitoring, logging, diagnostics
Manages across platforms
- Google Cloud and AWS
- Dynamic discovery of Google Cloud with smart defaults
- Open-source agents and integrations
Access to powerful data and analytics tools
Collaboration with third-party software

Cloud Monitoring

Dynamic config and intelligent defaults
Platform, system, and application metrics
- ingests data: Metrics, events, metadata
- Generates insights through the dashboard, charts, alerts
Uptime/health checks
Dashboards
Alerts

Custom metrics

Use the Cloud Monitoring API
Use OpenCensus
- Open-source library to help capture, manipulate, and export traces and metrics
- Low overhead and broadly supported
- Measure
  - represent a metric being recorded
- Measurement
  - data point recorded as a Measure
- View
  - represents the coupling of an Aggregation applied to a Measure and optionally Tags

Logs-based Metrics

Don't reinvent the wheel (Google has a curated list of over 1,000 predefined metrics, check there first)
Metric types
- Counter - counts the number of log entries
- Distribution - extracted log values in histogram buckets
  - Linear: buckets of fixed width
  - Exponential:
    - N+2 buckets
    - Upper:scale * (growthFactor ^ i)
    - Lower:scale * (growthFactor ^ (i-1))
  - Explicit: Array of bucket boundaries

OS Monitoring Agent

Support Compute Engine and AWS EC2
Based on the open-source collected
Gathers additional system resources and application metrics
Optional, but recommended
The additional support offered through BindPlane from Blue Medora

OS Logging Agent

Support third-party applications, such as
- Apache/Tomcat/Nginx
- Chef/Jenkins/Puppet
- Cassandra/MongoDB/MySQL
Based on fluentd log data collector (can add own fluentd configuration files)

Monitoring pyramid

Workspace

Determine your monitoring needs upfront
Consider using separate Workspaces for data and control isolation

Cloud Logging

Platform, systems, and application logs
- API to write to logs
- 30-day retention
Log search/view/filter
Log-based metrics
Monitoring alerts can be set on log events
Data can be exported to Cloud Storage, BigQuery, and Pub/Sub
Export to Cloud storage, BigQuery, and Pub/Sub
Data access logs are and Admin logs for 400 days
Logs
- Cloud Audit Logs
  - Admin activity
    - API calls modifying configuration or metadata
    - retained for 400 days
    - no charge
    - Always Enabled
  - Data Access
    - API calls that read metadata, and configuration or that create, modify, or read user-provided data
    - retained for 1-3650 days (30 default)
    - Data access by All Users or All Authenticated Users
    - Needs to be enabled
  - System Event
    - Record Google Cloud non-human admin actions that modify configurations
    - Part of Google's transparency
    - no charge
    - Retention is 400 days
    - Always Enabled
  - Access Transparency
    - How and Why customer data is accessed once it has been Stored in Google Cloud
- Agent Logs
- Network Logs
  - VPC flow
  - Firewall rules
  - NAT gateway
  - Load Balancer
Firewall Rule Logging

Log Router (Sink)

Logs from various sources reach Log Router
Checks against configured rules (What to ingest? What to discard? Where to route?)
Logs Storage (Logs buckets)
- _Required
  - Hold Admin activity, System Events & Access Transparency Logs (retained for 400 days)
  - ZERO charge
  - Can't change the retention period
- _Default
  - All other logs (retained for 30 days)
  - Can't DELETE but can disable log sink to disable ingestion
  - Can edit retention settings from 1 to 3650 days

Data Studio

Connect to BigQuery
Transform raw data into metrics and dimensions

Error Reporting

Error dashboard/notification
Counts analyze and aggregate the crashes in your running cloud services
Management interface displays the results with sorting and filtering capability
Centralized error management interface

Cloud Trace

Displays data/performance insights in near real-time
Latency reporting
Per-URL latency sampling
Automatically analyzes to generate in-depth latency reports
Support
- App Engine
- Google HTTP(S) load balancers
- Applications instrumented with the Cloud Trace SDKs

Cloud Debugger

Inspect an application without stopping it or slowing it down significantly
Capture call stack and local variables of a running application
Inject logging into a service without stopping it
Increased collaboration by sharing debug session
Debug snapshots, logpoints, conditional debugging
Integrations with popular IDEs

Cloud Profiler

Uses statistical techniques and extremely low-impact instrumentation that runs across all production application instances to provide a complete CPU and heap picture of an application
Support for Java, Go, Python, and NodeJS
Presents the call hierarchy and resource consumption of the relevant function in an interactive flame graph
Improve performance and reduce costs
Understand application call patterns

Service Monitoring

Understand and troubleshoot intra-service dependencies
Current support for App Engine, Anthos Service Mesh, and Istio
Know when you're meeting or breaking SLOs
Know when you have an error budget to spend
Select latency or availability metrics to act as SLIs
Use SLIs to easily create SLOs
Alerting easily integrated

Pricing

Free Allotment