Allow multiple Terraform instances to write to `plugin_cache_dir` concurrently #31964

amkartashov · 2022-10-07T12:36:18Z

Terraform Version

1.3.2

Terraform Configuration Files

plugin_cache_dir   = "$HOME/.terraform.d/plugin-cache"

Debug Output

no debug

Expected Behavior

Multiple terraform processes should work fine with the same plugin_cache_dir

Actual Behavior

Initializing provider plugins...

Reusing previous version of hashicorp/aws from the dependency lock file
Using hashicorp/aws v4.34.0 from the shared cache directory
Reusing previous version of hashicorp/aws from the dependency lock file
Using hashicorp/aws v4.34.0 from the shared cache directory
Using hashicorp/aws v4.34.0 from the shared cache directory
Using hashicorp/aws v4.34.0 from the shared cache directory
Installing datadog/datadog v3.16.0...
╷
│ Error: Failed to install provider from shared cache
│
│ Error while importing hashicorp/aws v4.34.0 from the shared cache
│ directory: the provider cache at .terraform/providers has a copy of
│ registry.terraform.io/hashicorp/aws 4.34.0 that doesn't match any of the
│ checksums recorded in the dependency lock file.

Steps to Reproduce

I'm using terragrunt, which does initialization of multiple tf stacks at once.

Additional Context

Race condition between two terraform init happens when they are trying to install the same provider same version. First tf calls installFromHTTPURL and it downloads to a temporary file with random name, but then it calls installFromLocalArchive and this unpacks directly to global plugins cache directory - this is there race condition occurs.

Targed dir set to globalCacheDir https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/installer.go#L470
Dir method InstallPackage is called here https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/installer.go#L482
InstallPackage calls installFromHTTPURL here https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/dir_modify.go#L34
installFromHTTPURL downloads archive to temporary file, no possibility for race condition, good: https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/package_install.go#L56
installFromHTTPURL calls installFromLocalArchive: https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/package_install.go#L97
installFromLocalArchive starts decompressing directly to final path: https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/package_install.go#L128

So if another terraform init happens to see half-unpacked plugin in the middle of step 6 - it will use not ready file.

References

gruntwork-io/terragrunt#1875

The text was updated successfully, but these errors were encountered:

alisdair · 2022-10-07T13:31:05Z

Thanks for the report. This behaviour is expected, and described in the plugin_cache_dir documentation:

Note: The plugin cache directory is not guaranteed to be concurrency safe. The provider installer's behavior in environments with multiple terraform init calls is undefined.

I can't find an enhancement request describing the desired outcome of being able to use multiple instances of terraform init with a global plugin cache, so I'm retitling and using this one. Another maintainer may be able to link to an existing issue and close this.

amkartashov · 2022-10-07T15:25:02Z

@alisdair thanks!

apparentlymart · 2022-10-10T19:12:15Z

If we do want to change something to make this concurrency-safe then I think a key requirement for us to navigate is ensuring we don't break anyone who currently has their cache directory on a network filesystem where e.g. filesystem-level locking may not be available or may not be reliable.

We didn't explicitly document that the plugin cache directory must be on a filesystem that is only visible to the current kernel, and I've seen questions online in the past which imply that people are already doing that and so I think that existing behavior is de-facto covered by the v1.x Compatibility Promises because when our documentation was ambiguous about something we typically favor keeping existing usage patterns working unless there's a strong reason to break them.

(I don't intend this to mean that we absolutely cannot add an additional restriction here, but if we wish to do that then I think we'll need to justify that the benefit outweighs the cost, and probably also provide a reasonable migration path or backward-compatibility mechanism for those who are relying on our current lack of global locking for the cache directory.)

Hopefully we can instead devise a solution which relies on the atomicity of certain filesystem operations rather than on explicit locking.

For example: perhaps Terraform could initially populate a directory named so that other Terraform processes won't find it, and then move it into its final location using an atomic move/rename system call. If the move/rename fails due to a conflicting directory of the same name, then the process which saw the error can scan the directory that already exists and see if it matches what it was trying to create and treat that as a success if so.

I recall that we originally didn't attempt this because it wasn't clear that we'd be able to provide the same guarantee across all platforms Terraform targets. In particular, I recall learning that Windows has different guarantees about the atomicity of rename operations than Unix-derived kernels typically do. For that reason I expect that a big part of the design effort for this issue will be to determine whether we can rely on some sort of atomic cache add on at least the primary supported OSes: Linux, macOS, and Windows.

The strategy does not necessarily need to be the same on all three platforms because the package directories in the cache are platform-specific anyway, so a process running on one platform should not observe a cache write from a process running on another platform. However, hybrid platforms like Microsoft's WSL might present unique problems if they end up imposing the filesystem guarantees from one platform to code that believes it's running on another platform.

amkartashov · 2022-10-11T10:21:29Z

@apparentlymart may be it'll work w/o locking? As installFromHTTPURL downloads to a temporary file, why installFromLocalArchive can't do the same? Unpack to a temporary location and then move to final path. I believe this won't break backwards compatibility. Other processes won't see half-unpacked file.

apparentlymart · 2022-10-11T14:53:35Z

As far as I'm aware, none of the installation methods are truly atomic today: even if we first extract into a temporary directory and then copy into the final location, there is no way to atomically copy a whole directory tree and so there will still be a window of time where another concurrent process can observe a directory that exists but doesn't yet have all of the expected files inside it, or has partial content for one file. In both of those situations the observing process will calculate a checksum from the partial data and so reject the incomplete directory.

I think the goal/requirement here is that at any instant there is either a fully complete and correct package in the cache or no package at all. If we can make that true then we can achieve safe concurrent use without any need for locks.

joe-a-t · 2022-10-18T14:50:54Z

Linking #25849 which has more history on this request

sftim · 2022-11-21T17:43:43Z

On Linux (only), renameat() supports a RENAME_NOREPLACE flag which can help with atomicity surprises. Windows has MoveFileEx where not overwriting existing files seems to be the default.

As the plugin cache is only a cache, failure to write any entry should not block continuing with the original operation.

grimm26 · 2023-04-27T19:51:37Z

@jbardin over in #32915 I encountered the "text file busy" issue and you pointed to here for an upcoming fix.
It happened to me again and I was able to investigate:

....
- Installing hashicorp/aws v4.64.0...
╷
│ Error: Failed to install provider
│ 
│ Error while installing hashicorp/aws v4.64.0: open
│ /home/mkeisler/.terraform.d/plugin-cache/registry.terraform.io/hashicorp/aws/4.64.0/linux_amd64/terraform-provider-aws_v4.64.0_x5:
│ text file busy
╵....
❯ lsof /home/mkeisler/.terraform.d/plugin-cache/registry.terraform.io/hashicorp/aws/4.64.0/linux_amd64/terraform-provider-aws_v4.64.0_x5
COMMAND      PID     USER  FD   TYPE DEVICE  SIZE/OFF    NODE NAME
terraform 405002 mkeisler txt    REG  259,5 354156544 6486037 /home/mkeisler/.terraform.d/plugin-cache/registry.terraform.io/hashicorp/aws/4.64.0/linux_amd64/terraform-provider-aws_v4.64.0_x5
....
❯ ps -fp 405002
UID          PID    PPID  C STIME TTY          TIME CMD
mkeisler  405002    2809  0 Apr26 ?        00:00:34 .terraform/providers/registry.terraform.io/hashicorp/aws/4.64.0/linux_amd64/terraform-provider-aws_v4.64.0_x5

So the short of it is that the provider executable itself is just sitting there spinning. I am using terragrunt, fyi. The parent ID is my /lib/systemd/systemd --user process (I'm on ubuntu 22).

Edit: verified that this happens using straight terraform without using terragrunt. Verified with terraform v1.4.4

ns-vpanfilov · 2023-06-16T16:49:12Z

We can work around locking issue; What we can do is to copy files from temporary location to the target directory and create a special file (e.g. active.true) after copy is done - if the file is present, then we can use this directory as cache source.

It is possible that more than 1 concurrent jobs will create duplicate directories with provider cache; in case we have duplicates, we can just pick the first occurrence, and delete remaining ones.

Also, we will need a process to periodically scan for incomplete cache directories and remove them.

MaxymVlasov · 2024-02-13T00:29:07Z

Locks (as files or dirs) do not work.

Just remember the state of .terraform/ dir - which files exist and which do not
Init everything async
If something fails - remove new stuff from .terraform/ and try once again.
Repeat step 3 a few times before giving up.

Based on the pain of lock implementation in antonbabenko/pre-commit-terraform#620

Also, it just works >5 times quicker, when you init 50+ dirs at once, compared to realization with lock mechanizm

isometry · 2024-03-19T14:29:12Z

A possibly naive question, but why does terraform init with plugin_cache_dir try to overwrite an existing provider installation rather than identifying that the provider is already present (i.e. stat, checksum/validate signature of existing binary)? If version/platform weren't already in the path, sure, but shouldn't the existing forced-overwrite behaviour be gated behind a -force-update-provider flag (or similar)?

(We face the issue where an active terraform (plan|apply) will fail a concurrent terraform init when the two modules share a plugin_cache_dir and provider)

ranesagar · 2024-03-27T19:14:39Z

A possibly naive question, but why does terraform init with plugin_cache_dir try to overwrite an existing provider installation rather than identifying that the provider is already present (i.e. stat, checksum/validate signature of existing binary)? If version/platform weren't already in the path, sure, but shouldn't the existing forced-overwrite behaviour be gated behind a -force-update-provider flag (or similar)?

(We face the issue where an active terraform (plan|apply) will fail a concurrent terraform init when the two modules share a plugin_cache_dir and provider)

I have the same question. Seems like an odd behavior to overwrite a binary that was already "cached". Kind of defeats the purpose of the cache in the first place.

joaocc · 2024-03-27T20:08:46Z

Not to say that re-downloading existing (vs checking their checksum and comparing with desired one) is probably costing terraform (or github) a non-insignificant amount of traffic)!
For reference, terraform providers lock from terragrunt (on a non-trivial scenario) takes 15m to process 21 providers (for 5 platforms).
It downloads some of them 2x prob due to parallel processing
If we wanted to do this for the whole tree, I would say some 2h just to refresh cache and/or update providers lock - it seems to take same time regardless of whether they are in cache or not.
Would really love to see this fixed in some manner :)

MaxymVlasov · 2024-03-28T12:36:00Z

If we wanted to do this for the whole tree, I would say some 2h just to refresh cache and/or update providers lock - it seems to take same time regardless of whether they are in cache or not.
Would really love to see this fixed in some manner :)

That's already fixed - #34632, it will be GA when 1.8.0 is released, now it exists only in pre-releases for 2 months.

wosiu · 2024-11-19T13:56:32Z

From what I can see this:
#33479
has been approved for months now, and from what I understand it waits to address some comments from core maintainers.

crw · 2024-11-19T20:39:29Z

@wosiu regarding core Terraform (not including the code for the various backends), only official maintainer approval matters. Nothing has otherwise changed with the status of that PR or the reason why it is in stasis. I am not sure why GitHub allows drive-by PR approvals, although yesterday there was a big permissions overhaul so it may no longer be possible. Sorry for the confusion. If the person is not in the HashiCorp organization (as indicated by GitHub) the approval is not meaningful.

amkartashov added bug new new issue not yet triaged labels Oct 7, 2022

alisdair added enhancement cli v1.3 Issues (primarily bugs) reported against v1.3 releases and removed bug new new issue not yet triaged labels Oct 7, 2022

alisdair changed the title ~~Race condition between multiple terrafrom init while using global plugin_cache_dir.~~ Allow multiple Terraform instances to write to plugin_cache_dir concurrently Oct 7, 2022

amkartashov mentioned this issue Oct 7, 2022

Terragrunt run-all init --upgrade is inconsistent regarding use of the shared-cache. gruntwork-io/terragrunt#1875

Closed

jbardin mentioned this issue Oct 18, 2022

Make init concurrency safe when using a provider cache #32028

Closed

apparentlymart mentioned this issue Nov 2, 2022

Allow multiple processes to run terraform providers mirror concurrently with the same target directory #32137

Open

joe-a-t mentioned this issue Nov 12, 2022

provider cache becomes ineffective with 1.4.0-alpha release #32205

Closed

apparentlymart mentioned this issue Mar 21, 2023

1.4.0+ breaks shared provider cache #32901

Open

jbardin mentioned this issue Mar 27, 2023

"text file busy" message when running init with plugin_cache_dir set (sometimes) #32915

Closed

finnag mentioned this issue Mar 28, 2023

Write safely to the shared plugin cache #32928

Closed

froblesmartin mentioned this issue Apr 9, 2023

Error "text file busy" : Random conflict on parallel stack execution which use the same provider hashicorp/terraform-cdk#2741

Closed

kapilt mentioned this issue Jun 20, 2023

preinstall terraform providers into tfcache to avoid concurrent writes at runtime cloud-custodian/pytest-terraform#63

Closed

mplzik linked a pull request Jul 6, 2023 that will close this issue

Unpack TF plugins in a more atomic way. #33479

Open

jreslock mentioned this issue Aug 30, 2023

Error using shared Terraform providers with a large number of parallel project plans runatlantis/atlantis#2242

Open

This was referenced Sep 4, 2023

feat: Add use plugin cache flag runatlantis/atlantis#3720

Merged

Running 'atlantis unlock' on a PR Causes The Whole Working Directory to be Deleted runatlantis/atlantis#3750

Closed

levkohimins mentioned this issue Feb 5, 2024

Add guidance/features for reducing disk space and bandwidth usage gruntwork-io/terragrunt#2920

Open

hemslo mentioned this issue May 2, 2024

Revert "Run terraform init in sequence" app-sre/qontract-reconcile#4329

Closed

stefanfreitag mentioned this issue Sep 11, 2024

[Bug]: aws Failed to install provider hashicorp/terraform-provider-aws#39218

Closed

wosiu mentioned this issue Nov 19, 2024

Speed up by caching plugins and providers runatlantis/atlantis#34

Closed

pietrodn linked a pull request Nov 23, 2024 that will close this issue

Concurrency-safe unpacking of TF providers. #36085

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow multiple Terraform instances to write to `plugin_cache_dir` concurrently #31964

Allow multiple Terraform instances to write to `plugin_cache_dir` concurrently #31964

amkartashov commented Oct 7, 2022 •

edited

Loading

alisdair commented Oct 7, 2022

amkartashov commented Oct 7, 2022

apparentlymart commented Oct 10, 2022

amkartashov commented Oct 11, 2022

apparentlymart commented Oct 11, 2022

joe-a-t commented Oct 18, 2022

sftim commented Nov 21, 2022

grimm26 commented Apr 27, 2023 •

edited

Loading

ns-vpanfilov commented Jun 16, 2023

MaxymVlasov commented Feb 13, 2024 •

edited

Loading

isometry commented Mar 19, 2024 •

edited

Loading

ranesagar commented Mar 27, 2024

joaocc commented Mar 27, 2024

MaxymVlasov commented Mar 28, 2024

wosiu commented Nov 19, 2024

crw commented Nov 19, 2024

Allow multiple Terraform instances to write to plugin_cache_dir concurrently #31964

Allow multiple Terraform instances to write to plugin_cache_dir concurrently #31964

Comments

amkartashov commented Oct 7, 2022 • edited Loading

Terraform Version

Terraform Configuration Files

Debug Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Additional Context

References

alisdair commented Oct 7, 2022

amkartashov commented Oct 7, 2022

apparentlymart commented Oct 10, 2022

amkartashov commented Oct 11, 2022

apparentlymart commented Oct 11, 2022

joe-a-t commented Oct 18, 2022

sftim commented Nov 21, 2022

grimm26 commented Apr 27, 2023 • edited Loading

ns-vpanfilov commented Jun 16, 2023

MaxymVlasov commented Feb 13, 2024 • edited Loading

isometry commented Mar 19, 2024 • edited Loading

ranesagar commented Mar 27, 2024

joaocc commented Mar 27, 2024

MaxymVlasov commented Mar 28, 2024

wosiu commented Nov 19, 2024

crw commented Nov 19, 2024

Allow multiple Terraform instances to write to `plugin_cache_dir` concurrently #31964

Allow multiple Terraform instances to write to `plugin_cache_dir` concurrently #31964

amkartashov commented Oct 7, 2022 •

edited

Loading

grimm26 commented Apr 27, 2023 •

edited

Loading

MaxymVlasov commented Feb 13, 2024 •

edited

Loading

isometry commented Mar 19, 2024 •

edited

Loading