-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow multiple Terraform instances to write to plugin_cache_dir
concurrently
#31964
Comments
Thanks for the report. This behaviour is expected, and described in the
I can't find an enhancement request describing the desired outcome of being able to use multiple instances of |
terrafrom init
while using global plugin_cache_dir.plugin_cache_dir
concurrently
@alisdair thanks! |
If we do want to change something to make this concurrency-safe then I think a key requirement for us to navigate is ensuring we don't break anyone who currently has their cache directory on a network filesystem where e.g. filesystem-level locking may not be available or may not be reliable. We didn't explicitly document that the plugin cache directory must be on a filesystem that is only visible to the current kernel, and I've seen questions online in the past which imply that people are already doing that and so I think that existing behavior is de-facto covered by the v1.x Compatibility Promises because when our documentation was ambiguous about something we typically favor keeping existing usage patterns working unless there's a strong reason to break them. (I don't intend this to mean that we absolutely cannot add an additional restriction here, but if we wish to do that then I think we'll need to justify that the benefit outweighs the cost, and probably also provide a reasonable migration path or backward-compatibility mechanism for those who are relying on our current lack of global locking for the cache directory.) Hopefully we can instead devise a solution which relies on the atomicity of certain filesystem operations rather than on explicit locking. For example: perhaps Terraform could initially populate a directory named so that other Terraform processes won't find it, and then move it into its final location using an atomic move/rename system call. If the move/rename fails due to a conflicting directory of the same name, then the process which saw the error can scan the directory that already exists and see if it matches what it was trying to create and treat that as a success if so. I recall that we originally didn't attempt this because it wasn't clear that we'd be able to provide the same guarantee across all platforms Terraform targets. In particular, I recall learning that Windows has different guarantees about the atomicity of rename operations than Unix-derived kernels typically do. For that reason I expect that a big part of the design effort for this issue will be to determine whether we can rely on some sort of atomic cache add on at least the primary supported OSes: Linux, macOS, and Windows. The strategy does not necessarily need to be the same on all three platforms because the package directories in the cache are platform-specific anyway, so a process running on one platform should not observe a cache write from a process running on another platform. However, hybrid platforms like Microsoft's WSL might present unique problems if they end up imposing the filesystem guarantees from one platform to code that believes it's running on another platform. |
@apparentlymart may be it'll work w/o locking? As |
As far as I'm aware, none of the installation methods are truly atomic today: even if we first extract into a temporary directory and then copy into the final location, there is no way to atomically copy a whole directory tree and so there will still be a window of time where another concurrent process can observe a directory that exists but doesn't yet have all of the expected files inside it, or has partial content for one file. In both of those situations the observing process will calculate a checksum from the partial data and so reject the incomplete directory. I think the goal/requirement here is that at any instant there is either a fully complete and correct package in the cache or no package at all. If we can make that true then we can achieve safe concurrent use without any need for locks. |
Linking #25849 which has more history on this request |
On Linux (only), As the plugin cache is only a cache, failure to write any entry should not block continuing with the original operation. |
@jbardin over in #32915 I encountered the "text file busy" issue and you pointed to here for an upcoming fix.
So the short of it is that the provider executable itself is just sitting there spinning. I am using terragrunt, fyi. The parent ID is my Edit: verified that this happens using straight terraform without using terragrunt. Verified with terraform v1.4.4 |
We can work around locking issue; What we can do is to copy files from temporary location to the target directory and create a special file (e.g. It is possible that more than 1 concurrent jobs will create duplicate directories with provider cache; in case we have duplicates, we can just pick the first occurrence, and delete remaining ones. Also, we will need a process to periodically scan for incomplete cache directories and remove them. |
Locks (as files or dirs) do not work.
Based on the pain of lock implementation in antonbabenko/pre-commit-terraform#620 Also, it just works >5 times quicker, when you init 50+ dirs at once, compared to realization with lock mechanizm |
A possibly naive question, but why does (We face the issue where an active |
I have the same question. Seems like an odd behavior to overwrite a binary that was already "cached". Kind of defeats the purpose of the cache in the first place. |
Not to say that re-downloading existing (vs checking their checksum and comparing with desired one) is probably costing terraform (or github) a non-insignificant amount of traffic)! |
That's already fixed - #34632, it will be GA when 1.8.0 is released, now it exists only in pre-releases for 2 months. |
From what I can see this: |
@wosiu regarding core Terraform (not including the code for the various backends), only official maintainer approval matters. Nothing has otherwise changed with the status of that PR or the reason why it is in stasis. I am not sure why GitHub allows drive-by PR approvals, although yesterday there was a big permissions overhaul so it may no longer be possible. Sorry for the confusion. If the person is not in the HashiCorp organization (as indicated by GitHub) the approval is not meaningful. |
Terraform Version
Terraform Configuration Files
Debug Output
no debug
Expected Behavior
Multiple terraform processes should work fine with the same
plugin_cache_dir
Actual Behavior
Initializing provider plugins...
╷
│ Error: Failed to install provider from shared cache
│
│ Error while importing hashicorp/aws v4.34.0 from the shared cache
│ directory: the provider cache at .terraform/providers has a copy of
│ registry.terraform.io/hashicorp/aws 4.34.0 that doesn't match any of the
│ checksums recorded in the dependency lock file.
Steps to Reproduce
I'm using terragrunt, which does initialization of multiple tf stacks at once.
Additional Context
Race condition between two terraform init happens when they are trying to install the same provider same version. First tf calls
installFromHTTPURL
and it downloads to a temporary file with random name, but then it callsinstallFromLocalArchive
and this unpacks directly to global plugins cache directory - this is there race condition occurs.globalCacheDir
https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/installer.go#L470InstallPackage
is called here https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/installer.go#L482InstallPackage
callsinstallFromHTTPURL
here https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/dir_modify.go#L34installFromHTTPURL
downloads archive to temporary file, no possibility for race condition, good: https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/package_install.go#L56installFromHTTPURL
callsinstallFromLocalArchive
: https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/package_install.go#L97installFromLocalArchive
starts decompressing directly to final path: https://github.com/hashicorp/terraform/blob/v1.3.2/internal/providercache/package_install.go#L128So if another
terraform init
happens to see half-unpacked plugin in the middle of step 6 - it will use not ready file.References
gruntwork-io/terragrunt#1875
The text was updated successfully, but these errors were encountered: