Journey to RDS Postgres Encryption-at-Rest with near-to-zero downtime.

Usman Shahid
5 min readNov 4, 2022

Hello Folks 👋🏻,

We’re here again to discuss another interesting topic which is to turn on encryption-at-rest for your new/existing AWS RDS Postgres instance.

You might be thinking that, Hey isn’t it an easy configuration change from the AWS console or a simple Terraform variable change from False to True? believe me! it’s not that easy for an existing RDS instance.

So before getting into deeper convo, Let’s start with some background.

What is Encryption-at-Rest?

Encryption at rest refers to protecting your data from unauthorized access by encrypting data while stored.

So in simple words, in encryption-at-rest, the data stored in the underlying storage gets encrypted from either a customer-managed KMS key or AWS managed KMS key.

This doesn’t affect/change any behavior on the engine side which means the access to the data from the user will remain the same and will not require any changes from the client side.

Whenever any particular block of data is accessed or stored in the underlying storage, it gets encrypted/decrypted using the same KMS key that is associated with the RDS instance. This means the encryption actually happens at the storage layer which in AWS is being handled through Elastic Block Volume.

How can we enable Encryption-at-Rest for a new instance?

If you’re onboarding a new instance, you will need to set enable encryption property under the Encryption block from the console with your desired KMS key.

I prefer using a customer-managed KMS key because it gives you the flexibility to share with different accounts and can move database snapshots easily.

How can we enable Encryption-at-Rest for an existing instance?

So far everything was super easy and straightforward, just a few clicks, and you’re done while for an existing instance AWS doesn’t let you change these configurations on the fly.

They have some suggested steps:

  1. Take a downtime on the RDS instance. [~ 2 — 5 mins] 😳
  2. Take a snapshot of the RDS instance. [~10 mins — 2 hours] 😩
  3. Encrypt a snapshot with the KMS key. [~10 mins — 2 hours] 😩
  4. Restore the instance from the encrypted snapshot. [~2–5 mins]😩
  5. Resume your traffic on the restored instance. [~2–5 mins]😳
  6. Done. 😩

These emojis are the actual reactions that I got from the management when I was explaining to them what we got from AWS. 😁

In any customer-driven company like Careem, this amount of downtime is close to a nightmare. we can barely take downtimes and that too for not more than 5–10 minutes in production.

So we had to come up with a strategy to reduce this downtime, we looked up at multiple solutions starting from publication/subscription replication, DMS, etc. We can have another story about the findings in a separate blog though in short, we decided to move with DMS.

Strategy 1 💥

In the DMS, we decided to keep the approach super simple at the beginning which is as follows;

  1. Set the following parameter values on the RDS on the parameter group. [~1–2 mins]
    1a: rds.logical_replication = 1
  2. Take the latest snapshot of the instance.
  3. Encrypt the snapshot with the KMS key.
  4. Restore the snapshot to a new instance.
  5. Create a DMS task for all databases under the instance with change data capture CDC configurations.
  6. Resume the replication by starting the task.
  7. Switchover traffic to the new instance. [~2–5 mins]

By following these steps we reduced the downtime on instance from 1–2 hours to just a few minutes.

It had a major drawback, once we will resume the replication at step 6 it will resume it for the newer records which means you will miss all the records during the snapshot-encryption-restoration period. Which is again a super scary situation to be in for engineers. This strategy is effective and easy if you have a read-heavy database and you can pause writes during the encryption activity while the database is still available for reads.

Strategy 2 💥

Since we couldn’t afford a downtime on writes as well so we had to come up with a another strategy and we got there using following steps;

  1. Set the following parameter values on the RDS parameter group. [~1–2 mins]
    1a: rds.logical_replication = 1
    2b: session_replication_role = ‘replica’
  2. Take the latest snapshot of the instance.
  3. Encrypt the snapshot with the KMS key.
  4. Restore the snapshot to a new instance.
  5. Create a DMS task for all databases under the instance with Full Load + CDC after either going with option 5a or 5b.
    5a: Truncate your destination database on the target instance.
    5b: Take a schema dump of the database, and restore it on the target instance.
  6. Verify the restored database on target and start the DMS task.
  7. The DMS will perform the Full Load and then move automatically to CDC.
  8. Switchover traffic to the new instance. [~2–5 mins]

The steps in this approach are little complicated but worth it. you might have noticed in the step-1, wE have updated an additional parameter which is session_replication_role to ‘replica’ which helps us with full-load. This will ignore foriegn key checks during the execution of DMS task. The DMS task loads data table by table which fails in default postgres settings because of foriegn key constraints.

What else?

There is another strategy that we didn’t disucss in detail, which is to use logical replication with a publication created on the source and subscription created on encrypted target instance with replica identity full enabled on all the tables, maybe that’s topic for another day!!

There is no easy way for such activities when you’re working with production databases having 24/7 traffic patterns, huge customer base and very limited slots for downtimes.

In future, I will our journey for MySQL databases as well which is quite different and really interesting to share as well, till then

That’s all folks! ;)

--

--

Usman Shahid

Software engineer by profession loves to read and write about best practices, innovations, new creative ideas. 💡