Skip to content

Conversation

@jjezra
Copy link
Contributor

@jjezra jjezra commented Nov 6, 2025

Catching an exception, the IndexingMerger attempts to detect if it should lessen work and retry. Currently it is an include list of known "retyable" errors. If the error is not detected correctly, or not predicted, it will not retry.
Being a background process, however, it is better and safer to retry after timeouts and after any FDB exception.

Resolves #3731

 Catching an exception, the IndexingMerger attempts to detect if it should lessen work and retry. Currently it is an include list of known "retyable"
 errors. If the error is not detected correctly, or not predicted, it will not retry.
 Being a background process, however, it is better and safer to retry by default.

  Resolves FoundationDB#3731
@jjezra jjezra added the enhancement New feature or request label Nov 6, 2025
if (ex == null) {
return false;
}
final Set<Integer> lessenWorkCodes = new HashSet<>(Arrays.asList(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This set can be declared as a static instance - no need to instantiate every time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wonder - wouldn't the JVM do the right thing here? It seems like an obvious optimization and I know that other languages are performing this optimization (and much more).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I saw, the Java compiler and JIT compiler will do little in terms of optimization of final and it is more intended to convey a design intent and to protect from unintentional change. Setting this to static final would probably be a good way to indicate that this is not changing and prevent extra work that is unneeded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjezra jjezra requested a review from ohadzeliger November 7, 2025 21:03
// Expecting AsyncToSyncTimeoutException or an instance of TimeoutException. However, cannot
// refer to AsyncToSyncTimeoutException without creating a lucene dependency
// TODO: remove this kludge
if (e.getClass().getCanonicalName().contains("Timeout") ) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the PR description:

Being a background process, however, it is better and safer to retry by default.

Why not just have this return false if there is any exception, except InterruptedException.
Here you abort for any exception that doesn't have Timeout in its name, or an FDBException as its cause.

Either way, it seems like having a test with a mock IndexMaintainer that has various failure scenarios would be valuable.

}
};
for (int i = 0; i < 100; i++) {
for (int i = 0; i < 20; i++) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottDugas , please verify -

  1. Using now a lower level index maintainer's merge to avoid retries on timeout.
  2. With "100" repetitions the flakyMergeQuick test was reaching the 5 minutes test limit. Had to take it down to 20. Is that acceptable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to disable all retries in the indexer used for merging.
The point of this test is to validate that if Lucene merge fails randomly in the middle it will still be usable, and not corrupted. Having retries in the IndexingMerger means that it could heal, but we want to make sure that requests coming in while the merge is ongoing don't get a corrupted view.

I'd also recommend on any comments necessary to make that clear.
The javadoc on the test is, apparently, too brief.

return e == null ||
(IndexingBase.findException(e, RecordCoreTimeoutException.class) == null &&
IndexingBase.findException(e, TimeoutException.class) == null &&
IndexingBase.findException(e, FDBException.class) == null);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the error codes
It seems like probably the following should not be retried:

  • batch_transaction_throttled -- this indicates the cluster is overloaded and trying not to do batch-level work. Retrying increases the load
  • tag_throttled -- Similar, but for a tag (I'm not sure if record layer supports tags, but if we don't know, we may in the future)

There's definitely a long list of errors in there that would be pointless to retry, which is why the original code had a list of error codes where it should lessen work, rather than a list where it shouldn't.

future.completeExceptionally(
customTimeout ?
new CustomOperationTimeoutException(exceptionMessage) :
new java.util.concurrent.TimeoutException(exceptionMessage));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be fully qualified, it should use the import

Suggested change
new java.util.concurrent.TimeoutException(exceptionMessage));
new TimeoutException(exceptionMessage));

try (OnlineIndexer indexer = OnlineIndexer.newBuilder()
.setRecordStoreBuilder(storeBuilder)
.setTargetIndexesByName(List.of(INDEX_NAME))
.setMaxAttempts(5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you setting maxAttempts to 5 here?

assertTrue(1 < attemptCount.get());
}

private static void adjustMergeControl(final IndexMaintainerState state) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This findCause and the CustomOperationTimeoutException seem arbitrarily placed as there are methods above and below that reference them. Generally, in the record layer, nested classes go at the end. helper methods are a bit more flexible, either at the end, or at the end of the tests I think would make sense.

}
}

private static <T extends Throwable> T findCause(Throwable throwable, Class<T> causeType) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this could be useful in TestHelpers.


// Wrap FDBException in multiple layers to test deep unwrapping
final CompletableFuture<Void> future = new CompletableFuture<>();
FDBException fdbEx = new FDBException("Transaction too old", FDBError.TRANSACTION_TOO_OLD.code());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an error code that was explicitly marked as lessening work codes before, should this test be parameterized to:

  1. An errorCode that the runner would retry
  2. An ErrorCode that was in shouldLessen
  3. An ErrorCode that was not, and shouldn't (e.g. no_cluster_file_found)

* An exception that is thrown when the async to sync operation times out.
*/
public static class AsyncToSyncTimeoutException extends RecordCoreException {
public static class AsyncToSyncTimeoutException extends RecordCoreTimeoutException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than backwards compatibility, is there any reason to keep this class around? Should we mark it as API.Status.DEPRECATED, and then, in the future have lucene throw RecordCoreTimeoutException instead?

}
};
for (int i = 0; i < 100; i++) {
for (int i = 0; i < 20; i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to disable all retries in the indexer used for merging.
The point of this test is to validate that if Lucene merge fails randomly in the middle it will still be usable, and not corrupted. Having retries in the IndexingMerger means that it could heal, but we want to make sure that requests coming in while the merge is ongoing don't get a corrupted view.

I'd also recommend on any comments necessary to make that clear.
The javadoc on the test is, apparently, too brief.

try {
LOGGER.info(KeyValueLogMessage.of("Merge started",
"iteration", i));
explicitMergeIndex(dataModel.index, contextProps, dataModel.schemaSetup);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use explicitMergeIndex anymore? A comment in the code would probably be valuable indicating why we're not using the helper method.

AtomicInteger attemptCount = new AtomicInteger(0);

TestFactory.register(indexType, state -> {
adjustMergeControl(state);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code does not retry any errors while opening the store. Are you intentionally leaving that as-is?
Are you intending to change that in a followup?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

IndexingMerger: lessen work and retry by default

3 participants