Post

SRE Foundations to Production: Securing Postgres on K8s (Part 7)

Secure Postgres with Secrets, resolve DB hiccups, and confirm USE and RED metrics with Prometheus in a Flask app on Kubernetes.


SRE Foundations to Production: Securing Postgres on K8s (Part 7)

Thanks for joining me for the final part of this project. We began in Part 1 with a simple Flask app, setting up basic monitoring. Since then, we’ve built something robust: adding Grafana in Part 2, alerting and EC2 in Part 3, SQLite and Loki in Part 4, scaling in Part 5, and Postgres with load balancing in Part 6. Now we will shift from Docker to Minikube, secure Postgres and finalize metrics.

Step 1: Secure Postgres with Secrets

I started by replacing hardcoded credentials with Secrets. Using a .env file, I defined environment variables like POSTGRES_USER and POSTGRES_PASSWORD with secure values, then used these to set up postgres-secret and flask-app-db-secret. These Secrets provide the credentials to both Postgres and Flask’s DB_URL, keeping sensitive data out of the codebase—a core SRE security practice.

  • Takeaway: Secrets are always better than plaintext for production readiness.
  • Action: Move your DB credentials to Secrets using .env or a secrets manager.
1
2
3
$ kubectl get secret postgres-secret
NAME             TYPE     DATA   AGE
postgres-secret  Opaque   2      1h

Step 2: Reset Postgres Data

Old data in /data/pgdata on Minikube’s hostPath caused some trouble, like password mismatches and a missing tasks database. I cleared it out using minikube ssh and sudo rm -rf /data/pgdata, then redeployed Postgres. This gave me a fresh start, allowing the tasks database to be created with the correct credentials.

  • Takeaway: Persistent data can lead to issues, so clear it completely when configurations change.
  • Action: Check your storage paths and reset if you run into unexpected state.

Use kubectl logs on your DB pod to catch initialization skips or auth failures early.

1
2
3
4
5
6
7
8
$ kubectl logs postgres-7cb5b5c5f4-c7vsh
...
2025-04-03 02:59:35.369 UTC [1] LOG:  starting PostgreSQL 15.12 (Debian 15.12-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2025-04-03 02:59:35.369 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2025-04-03 02:59:35.369 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2025-04-03 02:59:35.372 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2025-04-03 02:59:35.377 UTC [66] LOG:  database system was shut down at 2025-04-03 02:59:35 UTC
2025-04-03 02:59:35.382 UTC [1] LOG:  database system is ready to accept connections

Step 3: Confirm Flask DB Connection

Flask ran into issues at first, throwing password authentication failed and Name or service not known errors. After resetting Postgres, the connection worked smoothly, and the tasks table was created. It took some patience and digging through logs to get there.

  • Takeaway: DB startup delays can disrupt apps, so retry logic or delays can make a difference.
  • Action: Add connection retries in your app and keep an eye on logs during startup.
1
2
3
4
$ kubectl logs flask-app-dbc674757-gmc2m
...
{"asctime": "2025-04-03 02:59:37,772", "levelname": "INFO", "message": "Connected to DB"}
{"asctime": "2025-04-03 02:59:37,773", "levelname": "INFO", "message": "Table created"}

Step 4: Verify USE and RED Metrics

With the database issues resolved, I checked <flask-ip>:8000/metrics and saw the metrics I needed: request_count, request_processing_seconds, error_count, cpu_usage_percent, and memory_usage_bytes. This covers USE (Utilization: CPU and memory; Saturation: request time; Errors: service-level) and RED (Rate, Errors, Duration) for Prometheus to scrape.

  • Takeaway: Metrics are a key tool for SRE, so make sure to expose them early.
  • Action: Add prometheus_client to your app and test the metrics endpoint with curl.

Test your metrics endpoint with curl before assuming Prometheus can see them, as port mismatches can happen. Expect to see USE/RED metrics like in the example output below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# HELP request_count_total Total number of requests
# TYPE request_count_total counter
request_count_total{endpoint="/tasks",method="GET"} 1.0

# HELP request_processing_seconds Time spent processing requests
# TYPE request_processing_seconds summary
request_processing_seconds_count 1.0
request_processing_seconds_sum 0.011934616999951686

# HELP error_count_total Total number of errors
# TYPE error_count_total counter
error_count_total 0.0

# HELP cpu_usage_percent CPU usage percentage
# TYPE cpu_usage_percent gauge
cpu_usage_percent 1.3

# HELP memory_usage_bytes Memory usage in bytes
# TYPE memory_usage_bytes gauge
memory_usage_bytes 1.619718144e+09

With the alerts locked in, I wanted to make sure they’re not just noise but something you can actually act on. So, I added runbooks to the repo’s wiki for both HighCPUUsage and HighErrorRate. I also linked them directly in prometheus-rules.yaml with runbook_url fields. These guides walk you through diagnosing and fixing the issues step-by-step, turning a blaring alert into a clear plan. It’s a small tweak that makes a big difference when things heat up in production.

Why This Matters for SRE

This last step brings our setup to a secure and observable state. Secrets keep sensitive data safe, a clean database ensures consistency, and full USE and RED metrics give us clear insights into performance and reliability. Looking back, we’ve combined this with monitoring through Prometheus and Grafana, alerting with Alertmanager, log aggregation using Loki, load testing with Locust, and chaos engineering on Kubernetes. Together, these pieces form a solid foundation that reflects core SRE principles.

Try It Yourself

If you’d like to try these SRE practices, here’s the playbook:

  1. Secure your database with Secrets from .env.
  2. Confirm your app connects to the database using logs.
  3. Expose USE and RED metrics and verify them with Prometheus.

Best Practices Recap

  • Secrets First: Never store credentials in code; use Kubernetes Secrets instead.
  • Metric Coverage: Track USE and RED for complete observability.

Looking Back and Moving Forward

This series has guided us from a basic Flask app to a fully observable, scalable, and secure system running on Kubernetes. We’ve explored the essentials of SRE, including monitoring, alerting, logging, scaling, chaos engineering, and security. As this journey concludes, I hope the principles we’ve covered will help you in your own projects. Keep experimenting, stay curious, and let SRE principles guide you toward building reliable systems.

This post is licensed under CC BY 4.0 by the author.