Terraform Merge: An Unpleasant Journey – Yu-Cheng Chuang’s Blog

Managing AWS resources with Terraform has been a significant part of my job recently. Our team used to organize Terraform templates by microservices, but we finally hit a limitation: hidden mutual dependencies that make it too hard to work on without following a long operation manual, which cancels a lot of benefits that came with Terraform. Therefore, we decided to merge multiple templates into one, and I was in charge of this task.

I'll explain some backgrounds at the end, but I'd like to focus on how I did the merger because if you find this article, chances are that you have 10 reasons backing your decision to merge multiple Terraform templates, and want to prevent recreating the production resources; YMMV, but you find here.

All I can say is that it was an unpleasant journey.

At first, I thought it is as simple as:

Copy-paste the existing code and put them into modules (folders)
Pass variables from the root module to sub-modules
Find all the resources from the old state file
Figure out the address of the new resources (prefix the resource's address with module.${new_module_name})
Run batch terraform import commands on all the resources
Run terraform plan until Terraform is not planning to destroy or create any resources.

Code modification was easy. Listing all resources was not hard (although I created a script to extract them). However terraform import was not as easy as it seems.

A typical Terraform Import workflow is usually:

Find the identifier of the resource
Run terraform import to add the resource to the state file

Sounds easy? But it won't work for all types of resources.

Import is a feature that must be implemented in the provider itself. So these situations exist:

The identifier depends on the implementation of the provider – it may not be obvious. It may be an AWS ARN, the resource's name displayed on the AWS Console, or some combination of its parameters. Please refer to the manual.
Resources may support import command, but not all attributes will be imported, because not all attributes can be fetched from (AWS) API endpoints, e.g. passwords. In this case, Terraform may recreate the resource.
Resources may not support import command at all. Terraform will think they don't exist and create new instances.

The ID Guessing Game

My approach is trial-and-error — assuming all resources can be imported with the id attribute in the Terraform State file, and if the Import command complains, check the documentation and find out what should be used as an ID. Most of them work. Just some exceptions. Some.

The ID of aws_cloudwatch_log_stream is ${log_group_name}:${name}
The ID of aws_iam_user_policy_attachment is ${username}/${policy_arn}
The ID of aws_appautoscaling_target is ${service_namespace}/${resource_id}/${scalable_dimension}
The ID of aws_ecs_task_definition is the ARN, obviously
The ID of aws_security_group_rule is a combination of "security_group_id, type, protocol, from_port, to_port, and source(s)/destination(s) (e.g., cidr_block) separated by underscores _)." (cited from the official document)

Most resources I have encountered fall into some simple combinations of their attributes. The case of aws_security_group_rule is especially complicated. I assume there are some legacy compatibility issues, so I am not here to judge it. Just for your convenience, here is the JavaScript code I was using (not guaranteed to work for all combinations):

// attrs: the attributes property of that resources's instance from Terraform state
function aws_security_group_rule(attrs) {
    let source = "";
    if (attrs.cidr_blocks.length !== 0) {
        source = attrs.cidr_blocks.join('_');
    }  else if (attrs.source_security_group_id) {
        source = attrs.source_security_group_id;
    } else {
        console.error("CANNOT CONSTRUCT source OF aws_security_group_rule", attrs)
    }

    if (attrs.self === true) {
        source = `self_${source}`;
    }
    return `${attrs.security_group_id}_${attrs.type}_${attrs.protocol}_${attrs.from_port}_${attrs.to_port}_${source}`;
}

Secrets

Furthermore, I encountered some resources that are not possible to be imported by simply running terraform import:

As documented (links above), importing the result of the random-generated strings may still trigger re-creation on the next plan. The reason is that Terraform keeps a track of the random string's recipe (upper case, numbers, length, etc.) in the state file. import does not put those parameters into the state file when importing the value (because how do you know its recipe when you only have the dish, I mean, result?), therefore Terraform may think that the value does not match the recipe and try to recreate a new password, which invalidates the old password. To avoid this, refer to the documentation above, or just edit the state file and copy-paste the recipe (everything under attributes) from the old state file.

Other than the tainted value problem, you may also encounter political problems when importing random_password. As described above, to import a random value, you have to specify the random value's result to import, which is the actual secret password it generated, in the command line, that may stay in your shell history. This may sound like a security issue, but the fact that managing secrets in Terraform templates is already a security risk — the generated password is stored in the state file anyways, so anyone who can read the state file, i.e. anyone who can run Plan, can retrieve the password already — so, in reality, this is not more dangerous per see. But for your job security, check your company's security policy before doing this, whether or not it makes sense to you.

In addition to random values, if you have provisioned a tls_private_key in Terraform, you'll find it not possible to import, too. It is arguable whether or not creating a private key in the template is a good idea, but if the only thing you can do is import the state, in this case, the only solution is to copy-paste from the old state file.

False-Positive Drifts

Besides, some resources may trigger replacement because the Import was not implemented completely:

The master_password of an RDS instance cannot be imported. If this password is assigned by a random_password, make sure the random password won't be re-created (see above).

The secret value of aws_secretsmanager_secret_version might be marked as "need to change". In this case, if you are certain that the secret values won't change, likely, Terraform is just trying to mark the secret_string attribute as sensitive. The changes are internal only; they will not refresh the secret itself. To get rid of this, try manually editing the state file and modify to the sensitive_attributes array.

EC2 instances aws_instance may still trigger a replacement because Terraform believes network_interface should be recreated. I am not sure what condition triggers this behavior, but at least I workaround this issue by manually copying network_interface from the old state file over to the new one:

             "network_interface": [
+               {
+                 "delete_on_termination": false,
+                 "device_index": 0,
+                 "network_card_index": 0,
+                 "network_interface_id": "eni-XXXXXXXXXXXXXXXXX"
+               }
            ],

The Harmless Re-Creations

Some resources simply don't support terraform import, so when running terraform plan it will create new resources. But the only case I have encountered, aws_acm_certificate_validation , is harmless if they are recreated, so just let it run.

Other than the above resource ID issues, here are some tricks that helped me:

Terraform 1.3.x will try to evaluate the whole template before importing the resources. This slows down the process, and sometimes Terraform refuses to run when there are hidden dependencies (dynamic for_each being one example). I was using Terraform 1.2.9 to avoid this problem.
Running terraform import requires a -var-file. If you are using Terraform Cloud, just specify a file that has nothing in it.
Editing Terraform state file might be helpful if you don't want to deal with unwanted changes on the next Plan. It's simple: terraform state pull > state.json, edit it, increase serial in the root, then terraform state push state.json.
terraform import also requires read access to the resources that you want to import into the state. So for example, if the template uses some cross-account AWS credentials or providers from other cloud services, you'll want those credentials to be ready in your shell before running.

How Did We End Up Here

Prior to this merger, our templates were organized by microservices. There is a 'base' template for the resources that all the microservices depend on (e.g. network), then for each microservice, we have a template, that reads terraform_remote_state created by the base template, then builds its resources on top of them.

At first, we thought it will be a dependency tree, that there will be no circular dependencies. It late became clear that there are hidden dependencies managed outside Terraform's control, but back in the beginning we thought the dependencies can be easily found by reading the code, and we were able to manage them: a microservice A that wants to use resources from microservice B, and B may require resources created by A. Before we can refactor the template and create a dependency C that both A and B depend on, we must rollout to production, and proceed to another project. Cross-workspace references of terraform_remote_state can easily end up with mutual dependencies that we cannot manage easily, and at some point, we have to write a long operation guide, describing the dependencies of each microservices, and a step-by-step deployment guide to provision everything.

In short, starting with multiple Terraform templates is a premature optimization.

If you are about to adopt Terraform into your infrastructure management, and you are not sure whether to create one template or many, read the Terraform Recommended Practices, and put all things together as possible as you can, until you hit the border of the organization's structure (e.g. you have a DNS that is not managed by your team). Someday down the road, you'll find a necessity to split the mono-repo template into multiple ones, and by that time, you'll have a better idea of what should be separated. Probably you'll have an SRE team supporting you at that time. Avoid premature optimization at all costs.

Finally, this is not to say that Terraform isn't perfect and should be avoided. No tool is perfect. The benefits that Terraform brings to our team have exceeded the overheads that comes with it. Personally, I prefer Terraform (and similar declarative, type-checked Infra-as-Code tools) to e.g. AWS CDK. For the foreseeable future, I believe this will still be an important part of my work.