Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better error handling and recovery when using custom batch pin functionality #1255

Open
awrichar opened this issue Apr 3, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@awrichar
Copy link
Contributor

awrichar commented Apr 3, 2023

Part of FIR-17
Follow-on to #1213

One major issue with the functionality added in the PR above is that it's possible to get into a bad situation that prevents FireFly from dispatching new messages within particular constraints. Specifically, if these criteria are met:

  • You submit an /invoke request which includes a message, per the new added functionality
  • The invoke request fails in a non-recoverable way (such as when the parameters to the smart contract are malformed or unacceptable)

then the message batch will be built, sealed, and sent to the smart contract, but the blockchain transaction will fail. This will be retried forever, failing every time, and blocking the batch processor from processing more work (see below for details of the blockage).


FireFly creates "batch processors" to assemble and dispatch message batches. One is created for each unique combination of these dimensions:

  • message type
  • transaction type
  • author DID
  • blockchain signing key
  • group hash (for private messages)

In the "blocked" scenario described, no more messages for the blocked processor will be processed. Since this is limited to the new transaction type contract_invoke_pin, effective blockage scopes might be:

  • all contract_invoke_pin broadcasts from a particular identity
  • all contract_invoke_pin private messages from a particular identity to a particular privacy group

This is a very large scope of blockage - notably it is well beyond the scope of a single topic or ordering context.

In addition, for private messages, each message will claim a nonce/pin for each of its contexts (where a context is a unique combination of a topic and a group hash). If such a pin is assigned but never successfully dispatched, this will effectively kill that topic, as pins need to be processed sequentially, and a gap of this kind means no further message from that author will be honored.


Before the next feature release, we need some investigation into how the scope of such a blockage can be limited, how a user can most effectively avoid getting into this situation, and how a user can recover if this does happen.

@awrichar awrichar added the enhancement New feature or request label Apr 3, 2023
@awrichar
Copy link
Contributor Author

awrichar commented Apr 3, 2023

Some ideas for specific things that could be investigated to solve these problems:

  1. A better event or API for detecting this kind of failure - ie could emit an event when a "batch pin" operation fails, or attempt to further flesh out the "transaction status" API in some way
  2. A manual API to cancel a contract_invoke_pin transaction, such that the batch processor will abandon it and move on
  3. A manual way to trigger a "gap fill" of nonces - ie to advertise to batch recipients that some nonces were spent but ultimately not dispatched - this is closely tied to (2) and could even be an automatic side-effect
  4. A policy to automatically give up on a contract_invoke_pin transaction after some specific failures of this nature, such that the batch processor will abandon it and move on - closely related to (2) and (3)
  5. A pre-flight check before sealing the batch, to see if the blockchain transaction is likely to succeed - ie performing a "call" or "estimate gas"

@peterbroadhurst
Copy link
Contributor

One thing not on the list, that I'd like your thought on @awrichar:

  1. In the case of a batchOfOne, if submission fails X times in the retry loop, mark it as Failed and move on. This leaves the topics of any messages in the batch blocked until this is fixed, but allows other topics to progress.

This might go some way to solving (1) "detection API", and also when combined with with (3) "gap fill" API I wonder if this gives the narrower scope of failure that we need.

@awrichar
Copy link
Contributor Author

awrichar commented Apr 4, 2023

@peterbroadhurst yes I think that's (4) 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants