Better error handling and recovery when using custom batch pin functionality #1255

awrichar · 2023-04-03T18:48:51Z

Part of FIR-17
Follow-on to #1213

One major issue with the functionality added in the PR above is that it's possible to get into a bad situation that prevents FireFly from dispatching new messages within particular constraints. Specifically, if these criteria are met:

You submit an /invoke request which includes a message, per the new added functionality
The invoke request fails in a non-recoverable way (such as when the parameters to the smart contract are malformed or unacceptable)

then the message batch will be built, sealed, and sent to the smart contract, but the blockchain transaction will fail. This will be retried forever, failing every time, and blocking the batch processor from processing more work (see below for details of the blockage).

FireFly creates "batch processors" to assemble and dispatch message batches. One is created for each unique combination of these dimensions:

message type
transaction type
author DID
blockchain signing key
group hash (for private messages)

In the "blocked" scenario described, no more messages for the blocked processor will be processed. Since this is limited to the new transaction type contract_invoke_pin, effective blockage scopes might be:

all contract_invoke_pin broadcasts from a particular identity
all contract_invoke_pin private messages from a particular identity to a particular privacy group

This is a very large scope of blockage - notably it is well beyond the scope of a single topic or ordering context.

In addition, for private messages, each message will claim a nonce/pin for each of its contexts (where a context is a unique combination of a topic and a group hash). If such a pin is assigned but never successfully dispatched, this will effectively kill that topic, as pins need to be processed sequentially, and a gap of this kind means no further message from that author will be honored.

Before the next feature release, we need some investigation into how the scope of such a blockage can be limited, how a user can most effectively avoid getting into this situation, and how a user can recover if this does happen.

The text was updated successfully, but these errors were encountered:

awrichar · 2023-04-03T18:57:36Z

Some ideas for specific things that could be investigated to solve these problems:

A better event or API for detecting this kind of failure - ie could emit an event when a "batch pin" operation fails, or attempt to further flesh out the "transaction status" API in some way
A manual API to cancel a contract_invoke_pin transaction, such that the batch processor will abandon it and move on
A manual way to trigger a "gap fill" of nonces - ie to advertise to batch recipients that some nonces were spent but ultimately not dispatched - this is closely tied to (2) and could even be an automatic side-effect
A policy to automatically give up on a contract_invoke_pin transaction after some specific failures of this nature, such that the batch processor will abandon it and move on - closely related to (2) and (3)
A pre-flight check before sealing the batch, to see if the blockchain transaction is likely to succeed - ie performing a "call" or "estimate gas"

peterbroadhurst · 2023-04-04T08:47:58Z

One thing not on the list, that I'd like your thought on @awrichar:

In the case of a batchOfOne, if submission fails X times in the retry loop, mark it as Failed and move on. This leaves the topics of any messages in the batch blocked until this is fixed, but allows other topics to progress.

This might go some way to solving (1) "detection API", and also when combined with with (3) "gap fill" API I wonder if this gives the narrower scope of failure that we need.

awrichar · 2023-04-04T12:27:48Z

@peterbroadhurst yes I think that's (4) 🙂

awrichar added the enhancement New feature or request label Apr 3, 2023

awrichar mentioned this issue Apr 7, 2023

Operation that is retried many times may change from "Failed" to "Succeeded" #1264

Closed

awrichar mentioned this issue Jun 14, 2023

Invoke with data hyperledger/firefly-fir#17

Merged

awrichar mentioned this issue Jun 28, 2023

Cancel of operations in firefly #1357

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better error handling and recovery when using custom batch pin functionality #1255

Better error handling and recovery when using custom batch pin functionality #1255

awrichar commented Apr 3, 2023 •

edited

Loading

awrichar commented Apr 3, 2023 •

edited

Loading

peterbroadhurst commented Apr 4, 2023

awrichar commented Apr 4, 2023

Better error handling and recovery when using custom batch pin functionality #1255

Better error handling and recovery when using custom batch pin functionality #1255

Comments

awrichar commented Apr 3, 2023 • edited Loading

awrichar commented Apr 3, 2023 • edited Loading

peterbroadhurst commented Apr 4, 2023

awrichar commented Apr 4, 2023

awrichar commented Apr 3, 2023 •

edited

Loading

awrichar commented Apr 3, 2023 •

edited

Loading