Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple Terraform instances to write to plugin_cache_dir concurrently #31964

Open
amkartashov opened this issue Oct 7, 2022 · 16 comments · May be fixed by #33479 or #36085
Open

Allow multiple Terraform instances to write to plugin_cache_dir concurrently #31964

amkartashov opened this issue Oct 7, 2022 · 16 comments · May be fixed by #33479 or #36085
Labels
cli enhancement v1.3 Issues (primarily bugs) reported against v1.3 releases

Comments

@amkartashov
Copy link

amkartashov commented Oct 7, 2022

Terraform Version

1.3.2

Terraform Configuration Files

plugin_cache_dir   = "$HOME/.terraform.d/plugin-cache"

Debug Output

no debug

Expected Behavior

Multiple terraform processes should work fine with the same plugin_cache_dir

Actual Behavior

Initializing provider plugins...

  • Reusing previous version of hashicorp/aws from the dependency lock file
  • Using hashicorp/aws v4.34.0 from the shared cache directory
  • Reusing previous version of hashicorp/aws from the dependency lock file
  • Using hashicorp/aws v4.34.0 from the shared cache directory
  • Using hashicorp/aws v4.34.0 from the shared cache directory
  • Using hashicorp/aws v4.34.0 from the shared cache directory
  • Installing datadog/datadog v3.16.0...

    │ Error: Failed to install provider from shared cache

    │ Error while importing hashicorp/aws v4.34.0 from the shared cache
    │ directory: the provider cache at .terraform/providers has a copy of
    │ registry.terraform.io/hashicorp/aws 4.34.0 that doesn't match any of the
    │ checksums recorded in the dependency lock file.

Steps to Reproduce

I'm using terragrunt, which does initialization of multiple tf stacks at once.

Additional Context

Race condition between two terraform init happens when they are trying to install the same provider same version. First tf calls installFromHTTPURL and it downloads to a temporary file with random name, but then it calls installFromLocalArchive and this unpacks directly to global plugins cache directory - this is there race condition occurs.

  1. Targed dir set to globalCacheDir https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/installer.go#L470
  2. Dir method InstallPackage is called here https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/installer.go#L482
  3. InstallPackage calls installFromHTTPURL here https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/dir_modify.go#L34
  4. installFromHTTPURL downloads archive to temporary file, no possibility for race condition, good: https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/package_install.go#L56
  5. installFromHTTPURL calls installFromLocalArchive: https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/package_install.go#L97
  6. installFromLocalArchive starts decompressing directly to final path: https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/package_install.go#L128

So if another terraform init happens to see half-unpacked plugin in the middle of step 6 - it will use not ready file.

References

gruntwork-io/terragrunt#1875

@amkartashov amkartashov added bug new new issue not yet triaged labels Oct 7, 2022
@alisdair
Copy link
Contributor

alisdair commented Oct 7, 2022

Thanks for the report. This behaviour is expected, and described in the plugin_cache_dir documentation:

Note: The plugin cache directory is not guaranteed to be concurrency safe. The provider installer's behavior in environments with multiple terraform init calls is undefined.

I can't find an enhancement request describing the desired outcome of being able to use multiple instances of terraform init with a global plugin cache, so I'm retitling and using this one. Another maintainer may be able to link to an existing issue and close this.

@alisdair alisdair added enhancement cli v1.3 Issues (primarily bugs) reported against v1.3 releases and removed bug new new issue not yet triaged labels Oct 7, 2022
@alisdair alisdair changed the title Race condition between multiple terrafrom init while using global plugin_cache_dir. Allow multiple Terraform instances to write to plugin_cache_dir concurrently Oct 7, 2022
@amkartashov
Copy link
Author

@alisdair thanks!

@apparentlymart
Copy link
Contributor

If we do want to change something to make this concurrency-safe then I think a key requirement for us to navigate is ensuring we don't break anyone who currently has their cache directory on a network filesystem where e.g. filesystem-level locking may not be available or may not be reliable.

We didn't explicitly document that the plugin cache directory must be on a filesystem that is only visible to the current kernel, and I've seen questions online in the past which imply that people are already doing that and so I think that existing behavior is de-facto covered by the v1.x Compatibility Promises because when our documentation was ambiguous about something we typically favor keeping existing usage patterns working unless there's a strong reason to break them.

(I don't intend this to mean that we absolutely cannot add an additional restriction here, but if we wish to do that then I think we'll need to justify that the benefit outweighs the cost, and probably also provide a reasonable migration path or backward-compatibility mechanism for those who are relying on our current lack of global locking for the cache directory.)


Hopefully we can instead devise a solution which relies on the atomicity of certain filesystem operations rather than on explicit locking.

For example: perhaps Terraform could initially populate a directory named so that other Terraform processes won't find it, and then move it into its final location using an atomic move/rename system call. If the move/rename fails due to a conflicting directory of the same name, then the process which saw the error can scan the directory that already exists and see if it matches what it was trying to create and treat that as a success if so.

I recall that we originally didn't attempt this because it wasn't clear that we'd be able to provide the same guarantee across all platforms Terraform targets. In particular, I recall learning that Windows has different guarantees about the atomicity of rename operations than Unix-derived kernels typically do. For that reason I expect that a big part of the design effort for this issue will be to determine whether we can rely on some sort of atomic cache add on at least the primary supported OSes: Linux, macOS, and Windows.

The strategy does not necessarily need to be the same on all three platforms because the package directories in the cache are platform-specific anyway, so a process running on one platform should not observe a cache write from a process running on another platform. However, hybrid platforms like Microsoft's WSL might present unique problems if they end up imposing the filesystem guarantees from one platform to code that believes it's running on another platform.

@amkartashov
Copy link
Author

@apparentlymart may be it'll work w/o locking? As installFromHTTPURL downloads to a temporary file, why installFromLocalArchive can't do the same? Unpack to a temporary location and then move to final path. I believe this won't break backwards compatibility. Other processes won't see half-unpacked file.

@apparentlymart
Copy link
Contributor

As far as I'm aware, none of the installation methods are truly atomic today: even if we first extract into a temporary directory and then copy into the final location, there is no way to atomically copy a whole directory tree and so there will still be a window of time where another concurrent process can observe a directory that exists but doesn't yet have all of the expected files inside it, or has partial content for one file. In both of those situations the observing process will calculate a checksum from the partial data and so reject the incomplete directory.

I think the goal/requirement here is that at any instant there is either a fully complete and correct package in the cache or no package at all. If we can make that true then we can achieve safe concurrent use without any need for locks.

@joe-a-t
Copy link

joe-a-t commented Oct 18, 2022

Linking #25849 which has more history on this request

@sftim
Copy link

sftim commented Nov 21, 2022

On Linux (only), renameat() supports a RENAME_NOREPLACE flag which can help with atomicity surprises. Windows has MoveFileEx where not overwriting existing files seems to be the default.

As the plugin cache is only a cache, failure to write any entry should not block continuing with the original operation.

@grimm26
Copy link

grimm26 commented Apr 27, 2023

@jbardin over in #32915 I encountered the "text file busy" issue and you pointed to here for an upcoming fix.
It happened to me again and I was able to investigate:

....
- Installing hashicorp/aws v4.64.0...
╷
│ Error: Failed to install provider
│ 
│ Error while installing hashicorp/aws v4.64.0: open
│ /home/mkeisler/.terraform.d/plugin-cache/registry.terraform.io/hashicorp/aws/4.64.0/linux_amd64/terraform-provider-aws_v4.64.0_x5:
│ text file busy
╵....
❯ lsof /home/mkeisler/.terraform.d/plugin-cache/registry.terraform.io/hashicorp/aws/4.64.0/linux_amd64/terraform-provider-aws_v4.64.0_x5
COMMAND      PID     USER  FD   TYPE DEVICE  SIZE/OFF    NODE NAME
terraform 405002 mkeisler txt    REG  259,5 354156544 6486037 /home/mkeisler/.terraform.d/plugin-cache/registry.terraform.io/hashicorp/aws/4.64.0/linux_amd64/terraform-provider-aws_v4.64.0_x5
....
❯ ps -fp 405002
UID          PID    PPID  C STIME TTY          TIME CMD
mkeisler  405002    2809  0 Apr26 ?        00:00:34 .terraform/providers/registry.terraform.io/hashicorp/aws/4.64.0/linux_amd64/terraform-provider-aws_v4.64.0_x5

So the short of it is that the provider executable itself is just sitting there spinning. I am using terragrunt, fyi. The parent ID is my /lib/systemd/systemd --user process (I'm on ubuntu 22).

Edit: verified that this happens using straight terraform without using terragrunt. Verified with terraform v1.4.4

@ns-vpanfilov
Copy link

We can work around locking issue; What we can do is to copy files from temporary location to the target directory and create a special file (e.g. active.true) after copy is done - if the file is present, then we can use this directory as cache source.

It is possible that more than 1 concurrent jobs will create duplicate directories with provider cache; in case we have duplicates, we can just pick the first occurrence, and delete remaining ones.

Also, we will need a process to periodically scan for incomplete cache directories and remove them.

@MaxymVlasov
Copy link

MaxymVlasov commented Feb 13, 2024

Locks (as files or dirs) do not work.

  1. Just remember the state of .terraform/ dir - which files exist and which do not
  2. Init everything async
  3. If something fails - remove new stuff from .terraform/ and try once again.
  4. Repeat step 3 a few times before giving up.

Based on the pain of lock implementation in antonbabenko/pre-commit-terraform#620

Also, it just works >5 times quicker, when you init 50+ dirs at once, compared to realization with lock mechanizm

@isometry
Copy link

isometry commented Mar 19, 2024

A possibly naive question, but why does terraform init with plugin_cache_dir try to overwrite an existing provider installation rather than identifying that the provider is already present (i.e. stat, checksum/validate signature of existing binary)? If version/platform weren't already in the path, sure, but shouldn't the existing forced-overwrite behaviour be gated behind a -force-update-provider flag (or similar)?

(We face the issue where an active terraform (plan|apply) will fail a concurrent terraform init when the two modules share a plugin_cache_dir and provider)

@ranesagar
Copy link

A possibly naive question, but why does terraform init with plugin_cache_dir try to overwrite an existing provider installation rather than identifying that the provider is already present (i.e. stat, checksum/validate signature of existing binary)? If version/platform weren't already in the path, sure, but shouldn't the existing forced-overwrite behaviour be gated behind a -force-update-provider flag (or similar)?

(We face the issue where an active terraform (plan|apply) will fail a concurrent terraform init when the two modules share a plugin_cache_dir and provider)

I have the same question. Seems like an odd behavior to overwrite a binary that was already "cached". Kind of defeats the purpose of the cache in the first place.

@joaocc
Copy link
Contributor

joaocc commented Mar 27, 2024

Not to say that re-downloading existing (vs checking their checksum and comparing with desired one) is probably costing terraform (or github) a non-insignificant amount of traffic)!
For reference, terraform providers lock from terragrunt (on a non-trivial scenario) takes 15m to process 21 providers (for 5 platforms).
It downloads some of them 2x prob due to parallel processing
If we wanted to do this for the whole tree, I would say some 2h just to refresh cache and/or update providers lock - it seems to take same time regardless of whether they are in cache or not.
Would really love to see this fixed in some manner :)

@MaxymVlasov
Copy link

If we wanted to do this for the whole tree, I would say some 2h just to refresh cache and/or update providers lock - it seems to take same time regardless of whether they are in cache or not.
Would really love to see this fixed in some manner :)

That's already fixed - #34632, it will be GA when 1.8.0 is released, now it exists only in pre-releases for 2 months.

@wosiu
Copy link

wosiu commented Nov 19, 2024

From what I can see this:
#33479
has been approved for months now, and from what I understand it waits to address some comments from core maintainers.

@crw
Copy link
Collaborator

crw commented Nov 19, 2024

@wosiu regarding core Terraform (not including the code for the various backends), only official maintainer approval matters. Nothing has otherwise changed with the status of that PR or the reason why it is in stasis. I am not sure why GitHub allows drive-by PR approvals, although yesterday there was a big permissions overhaul so it may no longer be possible. Sorry for the confusion. If the person is not in the HashiCorp organization (as indicated by GitHub) the approval is not meaningful.

@pietrodn pietrodn linked a pull request Nov 23, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cli enhancement v1.3 Issues (primarily bugs) reported against v1.3 releases
Projects
None yet