Terraform Merge: An Unpleasant Journey

Managing AWS resources with Terraform has been a significant part of my job recently. Our team used to organize Terraform templates by microservices, but we finally hit a limitation: hidden mutual dependencies that make it too hard to work on without following a long operation manual, which cancels a lot of benefits that came with Terraform. Therefore, we decided to merge multiple templates into one, and I was in charge of this task.

I'll explain some backgrounds at the end, but I'd like to focus on how I did the merger because if you find this article, chances are that you have 10 reasons backing your decision to merge multiple Terraform templates, and want to prevent recreating the production resources; YMMV, but you find here.

All I can say is that it was an unpleasant journey.

At first, I thought it is as simple as:

Code modification was easy. Listing all resources was not hard (although I created a script to extract them). However terraform import was not as easy as it seems.

A typical Terraform Import workflow is usually:

Sounds easy? But it won't work for all types of resources.

Import is a feature that must be implemented in the provider itself. So these situations exist:

The ID Guessing Game

My approach is trial-and-error — assuming all resources can be imported with the id attribute in the Terraform State file, and if the Import command complains, check the documentation and find out what should be used as an ID. Most of them work. Just some exceptions. Some.

Most resources I have encountered fall into some simple combinations of their attributes. The case of aws_security_group_rule is especially complicated. I assume there are some legacy compatibility issues, so I am not here to judge it. Just for your convenience, here is the JavaScript code I was using (not guaranteed to work for all combinations):

// attrs: the attributes property of that resources's instance from Terraform state
function aws_security_group_rule(attrs) {
    let source = "";
    if (attrs.cidr_blocks.length !== 0) {
        source = attrs.cidr_blocks.join('_');
    }  else if (attrs.source_security_group_id) {
        source = attrs.source_security_group_id;
    } else {
        console.error("CANNOT CONSTRUCT source OF aws_security_group_rule", attrs)
    }

    if (attrs.self === true) {
        source = `self_${source}`;
    }
    return `${attrs.security_group_id}_${attrs.type}_${attrs.protocol}_${attrs.from_port}_${attrs.to_port}_${source}`;
}

Secrets

Furthermore, I encountered some resources that are not possible to be imported by simply running terraform import:

As documented (links above), importing the result of the random-generated strings may still trigger re-creation on the next plan. The reason is that Terraform keeps a track of the random string's recipe (upper case, numbers, length, etc.) in the state file. import does not put those parameters into the state file when importing the value (because how do you know its recipe when you only have the dish, I mean, result?), therefore Terraform may think that the value does not match the recipe and try to recreate a new password, which invalidates the old password. To avoid this, refer to the documentation above, or just edit the state file and copy-paste the recipe (everything under attributes) from the old state file.

Other than the tainted value problem, you may also encounter political problems when importing random_password. As described above, to import a random value, you have to specify the random value's result to import, which is the actual secret password it generated, in the command line, that may stay in your shell history. This may sound like a security issue, but the fact that managing secrets in Terraform templates is already a security risk — the generated password is stored in the state file anyways, so anyone who can read the state file, i.e. anyone who can run Plan, can retrieve the password already — so, in reality, this is not more dangerous per see. But for your job security, check your company's security policy before doing this, whether or not it makes sense to you.

In addition to random values, if you have provisioned a tls_private_key in Terraform, you'll find it not possible to import, too. It is arguable whether or not creating a private key in the template is a good idea, but if the only thing you can do is import the state, in this case, the only solution is to copy-paste from the old state file.

False-Positive Drifts

Besides, some resources may trigger replacement because the Import was not implemented completely:

The master_password of an RDS instance cannot be imported. If this password is assigned by a random_password, make sure the random password won't be re-created (see above).

The secret value of aws_secretsmanager_secret_version might be marked as "need to change". In this case, if you are certain that the secret values won't change, likely, Terraform is just trying to mark the secret_string attribute as sensitive. The changes are internal only; they will not refresh the secret itself. To get rid of this, try manually editing the state file and modify to the sensitive_attributes array.

EC2 instances aws_instance may still trigger a replacement because Terraform believes  network_interface should be recreated. I am not sure what condition triggers this behavior, but at least I workaround this issue by manually copying network_interface from the old state file over to the new one:  

             "network_interface": [
+               {
+                 "delete_on_termination": false,
+                 "device_index": 0,
+                 "network_card_index": 0,
+                 "network_interface_id": "eni-XXXXXXXXXXXXXXXXX"
+               }
            ],

The Harmless Re-Creations

Some resources simply don't support terraform import, so when running terraform plan it will create new resources. But the only case I have encountered, aws_acm_certificate_validation , is harmless if they are recreated, so just let it run.


Other than the above resource ID issues, here are some tricks that helped me:


How Did We End Up Here

Prior to this merger, our templates were organized by microservices. There is a 'base' template for the resources that all the microservices depend on (e.g. network), then for each microservice, we have a template, that reads terraform_remote_state created by the base template, then builds its resources on top of them.

At first, we thought it will be a dependency tree, that there will be no circular dependencies. It late became clear that there are hidden dependencies managed outside Terraform's control, but back in the beginning we thought the dependencies can be easily found by reading the code, and we were able to manage them: a microservice A that wants to use resources from microservice B, and B may require resources created by A. Before we can refactor the template and create a dependency C that both A and B depend on, we must rollout to production, and proceed to another project. Cross-workspace references of terraform_remote_state can easily end up with mutual dependencies that we cannot manage easily, and at some point, we have to write a long operation guide, describing the dependencies of each microservices, and a step-by-step deployment guide to provision everything.

In short, starting with multiple Terraform templates is a premature optimization.

If you are about to adopt Terraform into your infrastructure management, and you are not sure whether to create one template or many, read the Terraform Recommended Practices, and put all things together as possible as you can, until you hit the border of the organization's structure (e.g. you have a DNS that is not managed by your team). Someday down the road, you'll find a necessity to split the mono-repo template into multiple ones, and by that time, you'll have a better idea of what should be separated. Probably you'll have an SRE team supporting you at that time. Avoid premature optimization at all costs.


Finally, this is not to say that Terraform isn't perfect and should be avoided. No tool is perfect. The benefits that Terraform brings to our team have exceeded the overheads that comes with it. Personally, I prefer Terraform (and similar declarative, type-checked Infra-as-Code tools) to e.g. AWS CDK. For the foreseeable future, I believe this will still be an important part of my work.