Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All events fail to process on a node if the source event's plugin configuration is missing #366

Open
jebonfig opened this issue Dec 23, 2021 · 3 comments
Assignees

Comments

@jebonfig
Copy link
Contributor

Steps to reproduce:

  • Create FF node1 configured with the erc1155 tokens plugin
  • Create a token pool / mint some tokens on FF node1
  • Create and add FF node2 to the network configured without the erc1155 tokens plugin
  • FF node2 fails to process the tokens events and gets stuck in an overall event processing rollback. It seems like this rollback prevents other non token events from processing and therefore prevents the org from registering itself in the network

Expected behavior:
I would expect events received from a plugin that is not configured on a node to be ignored / handled appropriately until that plugin is configured on that node

logs snippet from FF node2:

[]  INFO Node not yet registered pid=284
[]  INFO Confirming system broadcast 'ff_define_node' [3760c6f0-da92-4610-8da6-31efb3edbf7f] dbtx=urGDLcz7 pid=284 role=aggregator
[]  INFO ==> PUT http://localhost:4000/api/v1/peers/Kaleido-zzg7nwrkr3 breq=i2CSTj7B dx=https pid=284
[]  INFO <== PUT http://localhost:4000/api/v1/peers/Kaleido-zzg7nwrkr3 [200] (104.76ms) breq=i2CSTj7B dx=https pid=284
[]  INFO Emitting message_confirmed for message ff_system:3760c6f0-da92-4610-8da6-31efb3edbf7f dbtx=urGDLcz7 pid=284 role=aggregator
[]  INFO Confirming system broadcast 'ff_define_pool' [34583289-1072-4e65-ba77-b9c9726a3a82] dbtx=urGDLcz7 pid=284 role=aggregator
[] ERROR Failed to activate token pool 'd4b913a5-5482-41e2-9100-493abbd724f1': FF10272: Unknown tokens plugin 'erc1155' dbtx=urGDLcz7 pid=284 role=aggregator
[]  WARN SQL! transaction rollback dbtx=urGDLcz7 pid=284 role=aggregator
[] ERROR process events attempt 1: FF10272: Unknown tokens plugin 'erc1155' pid=284 role=ep[ff_system:ff_aggregator]
[]  INFO Confirming system broadcast 'ff_define_node' [3760c6f0-da92-4610-8da6-31efb3edbf7f] dbtx=BxMYSRuO pid=284 role=aggregator
[]  INFO ==> PUT http://localhost:4000/api/v1/peers/Kaleido-zzg7nwrkr3 breq=Y0MYmZ0_ dx=https pid=284
[]  INFO <== PUT http://localhost:4000/api/v1/peers/Kaleido-zzg7nwrkr3 [200] (73.50ms) breq=Y0MYmZ0_ dx=https pid=284
[]  INFO Emitting message_confirmed for message ff_system:3760c6f0-da92-4610-8da6-31efb3edbf7f dbtx=BxMYSRuO pid=284 role=aggregator
[]  INFO Confirming system broadcast 'ff_define_pool' [34583289-1072-4e65-ba77-b9c9726a3a82] dbtx=BxMYSRuO pid=284 role=aggregator
[] ERROR Failed to activate token pool 'd4b913a5-5482-41e2-9100-493abbd724f1': FF10272: Unknown tokens plugin 'erc1155' dbtx=BxMYSRuO pid=284 role=aggregator
[]  WARN SQL! transaction rollback dbtx=BxMYSRuO pid=284 role=aggregator
[] ERROR process events attempt 2: FF10272: Unknown tokens plugin 'erc1155' pid=284 role=ep[ff_system:ff_aggregator]
[]  INFO Node not yet registered pid=284
.... (reattempt sequence repeats)
[]  INFO <-- POST /api/v1/network/organizations/self [408] (120003.10ms): FF10260: The request with id '...' timed out after 119,298.88ms httpreq=lsvN1NrO pid=146 req=ho3P3xSP
@peterbroadhurst
Copy link
Contributor

@jebonfig - I've assigned this over to @awrichar for some deeper thinking, but this is a hard one to solve.

It is the model of FireFly that everybody processes broadcast events in the same order, in order to build the same shared state. So ignoring an event is a significant thing to do.

However, we do also have the concept of topics that designate separate streams of messages that must be blocked until their messages are complete. In this case, I think it would be valid to consider the lack of a suitable plugin as a reason to consider a message such as this incomplete, and rewind to it when the plugin configuration is available. Then it would be just that one topic that is blocked - rather than all broadcasts.

The problem is the complexity of detecting the adding of new token config as an event, and working out how to rewind to it (as I'm not sure there's any indexed field necessarily available to detect the situation). I'll leave @awrichar to consider the possibility, and cost vs. benefit of this.

@awrichar
Copy link
Contributor

It is (or should be) an explicit requirement for all nodes to have the exact same token config. Operating under any other state is considered a malformed configuration with potentially undefined behavior. But we could give some more thought over how to gracefully handle this error scenario...

@awrichar
Copy link
Contributor

awrichar commented Dec 27, 2021

A few notes for the record:

  • Token pool definitions happen on the same topic as datatype definitions, of the form ff_ns_{namespace}.
  • The error in question occurs when a token pool definition is received. The token pool is written to the database in "pending" state, then the manager attempts to locate the proper plugin and activate the pool. When no plugin can be found, the handler errors out and rolls back the database transaction.
  • This failure is flagged as SystemBroadcastAction.ActionRetry, so it will retry processing indefinitely and will not process any further events.
  • I assume (need to confirm) that adding the proper config and restarting the node would allow things to recover.

To flesh out Peter's suggestion above, I think these would be the needed steps:

  1. Move token pool definitions to a dedicated topic - I'd suggest one per plugin name, ie ff_token_{plugin}.
  2. Consider this a non-fatal error in the event handler - return SystemBroadcastAction.ActionWait and probably allow the token pool to be written to the database in "pending" state despite this (or consider a distinct state other than "pending" here).
  3. Find a way to re-process blocked token pool requests when the config changes. This implies one of two options:
    a. Cache all config and track changes between starts.
    b. On every start, scan for token pools in "pending" state, then check if their plugin name is found in the current config. Assume this may be a newly added config, so correlate from the token pool back to the message batch, and tell the aggregator to rewind and re-process that batch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants