Skip to content

Alexeyk/gitlab cache metadata recovery#11515

Open
AlexeyKuznetsov-DD wants to merge 7 commits into
masterfrom
alexeyk/gitlab-cache-metadata-recovery
Open

Alexeyk/gitlab cache metadata recovery#11515
AlexeyKuznetsov-DD wants to merge 7 commits into
masterfrom
alexeyk/gitlab-cache-metadata-recovery

Conversation

@AlexeyKuznetsov-DD
Copy link
Copy Markdown
Contributor

What Does This Do

A fix for when cache restore failed on GitHub: FATAL: unexpected EOF

Motivation

Quite a lot of jobs started to fail on GitLab randomly with error FATAL: unexpected EOF since May 22.
After some investigation I found a way to workaround the issue by clearing corrupted metadata.bin files.
That allows to restore cache as part of the build.

Additional Notes

Real issue should be fixed at GitLab. This PR is a way to unblock merging to master.

@AlexeyKuznetsov-DD AlexeyKuznetsov-DD requested a review from bric3 June 1, 2026 03:49
@AlexeyKuznetsov-DD AlexeyKuznetsov-DD self-assigned this Jun 1, 2026
@AlexeyKuznetsov-DD AlexeyKuznetsov-DD added type: bug Bug report and fix tag: no release notes Changes to exclude from release notes comp: tooling Build & Tooling labels Jun 1, 2026
@AlexeyKuznetsov-DD AlexeyKuznetsov-DD marked this pull request as ready for review June 1, 2026 03:49
@AlexeyKuznetsov-DD AlexeyKuznetsov-DD requested a review from a team as a code owner June 1, 2026 03:49
@datadog-prod-us1-4

This comment has been minimized.

@AlexeyKuznetsov-DD AlexeyKuznetsov-DD requested a review from a team as a code owner June 1, 2026 14:53
@AlexeyKuznetsov-DD AlexeyKuznetsov-DD requested review from erikayasuda and removed request for a team June 1, 2026 14:53
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented Jun 1, 2026

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite Status
Startup 🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results
Scenario Candidate master Δ (95% CI of mean)
startup:insecure-bank:iast:Agent 13.96 s 13.98 s [-1.4%; +1.0%] (no difference)
startup:insecure-bank:tracing:Agent 12.95 s 12.93 s [-0.8%; +1.1%] (no difference)
startup:petclinic:appsec:Agent 16.66 s 16.30 s [+0.7%; +3.6%] (maybe worse)
startup:petclinic:iast:Agent 16.56 s 16.67 s [-1.8%; +0.6%] (no difference)
startup:petclinic:profiling:Agent 16.53 s 16.62 s [-2.1%; +1.1%] (no difference)
startup:petclinic:tracing:Agent 15.74 s 16.01 s [-2.9%; -0.5%] (maybe better)

Commit: c27a3208 · CI Pipeline · Benchmarking Platform UI


Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

Comment thread .gitlab-ci.yml
Comment on lines +252 to +304
# A partial/aborted GitLab cache extraction (the runner logs "FATAL: unexpected EOF" /
# "Failed to extract cache" then continues anyway) can leave a Gradle immutable-workspace
# (dependencies-accessors, groovy-dsl, kotlin-dsl, transforms) whose metadata.bin is either
# TRUNCATED (EOFException) or entirely MISSING (FileNotFoundException). Either way Gradle
# hard-fails during configuration with "Could not read workspace metadata" and does NOT
# self-heal. Capture evidence (good + damaged) for diagnosis, then use Gradle's own metadata
# reader to identify and drop only the damaged workspace dirs so Gradle regenerates them (only
# the entries that would fail anyway are removed). See gradle/gradle#28974.
- |
GRADLE_METADATA_EVIDENCE_DIR="gradle-cache-metadata"
mkdir -p "$GRADLE_METADATA_EVIDENCE_DIR"
# Manifest of every metadata.bin with its byte size (damaged ones sort to the top at 0/near-0).
find .gradle/caches -type f -name metadata.bin -printf '%10s %p\n' 2>/dev/null \
| sort -n > "$GRADLE_METADATA_EVIDENCE_DIR/metadata-manifest.txt" || true
damaged=""
# ValidateGradleMetadata enumerates the immutable-workspace dirs, skips Gradle temporary
# workspaces, and prints damaged ones as "<size>\t<reason>\t<workspace>" (exit 1 via wrapper).
validator_output="$GRADLE_METADATA_EVIDENCE_DIR/validator-output.txt"
validator_error="$GRADLE_METADATA_EVIDENCE_DIR/validator-error.txt"
validator_status=0
.gitlab/validate_gradle_metadata.sh "$GRADLE_VERSION" \
> "$validator_output" 2> "$validator_error" || validator_status=$?
if [ "$validator_status" -eq 1 ]; then
while IFS=$'\t' read -r size reason ws; do
[ -n "$ws" ] || continue
[ -d "$ws" ] || continue
meta="${ws}/metadata.bin"
if [ -z "$damaged" ]; then
echo -e "${TEXT_BOLD}${TEXT_YELLOW}[WARNING] Damaged Gradle metadata found, fixed:${TEXT_CLEAR}"
damaged="yes"
fi
dest="$GRADLE_METADATA_EVIDENCE_DIR/corrupt/${ws}"
mkdir -p "$dest" || true
if [ -f "$meta" ]; then cp -p "$meta" "$dest/metadata.bin" 2>/dev/null || true; fi
ls -la "$ws" > "$dest/dir-listing.txt" 2>/dev/null || true
echo " - ${ws} (metadata.bin size=${size}; ${reason})"
rm -rf "$ws" || true
done < "$validator_output"
elif [ "$validator_status" -ne 0 ]; then
echo -e "${TEXT_BOLD}${TEXT_YELLOW}[WARNING] Gradle metadata validator unavailable; leaving cache unchanged.${TEXT_CLEAR}"
cat "$validator_error" || true
fi
if [ -z "$damaged" ]; then echo "No damaged Gradle immutable-workspace metadata detected."; fi
# Keep a few intact metadata.bin files for byte-level comparison with the damaged ones.
# `find | head` makes find die with SIGPIPE once head closes the pipe after 10 lines; under
# `set -o pipefail` (GitLab's bash default) that non-zero would propagate as this block's exit
# code and, being the shared default before_script, fail every build job. This is best-effort
# diagnostics and must never fail the job, so guard the pipeline with `|| true` like the rest.
find .gradle/caches -type f -name metadata.bin -size +32c 2>/dev/null | head -n 10 | while IFS= read -r f; do
dest="$GRADLE_METADATA_EVIDENCE_DIR/good/$(dirname "$f")"
mkdir -p "$dest" || true
cp -p "$f" "$dest/metadata.bin" 2>/dev/null || true
done || true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Rather than having one part script one part java (.gitlab/validate_gradle_metadata.sh "$GRADLE_VERSION" ...), I would pack much into the the Java program, and name it something like CorruptGitlabGradleCacheExtractionMitigator.

Comment on lines +93 to +102
private static void loadMetadataReader() {
try {
var storeClass =
Class.forName("org.gradle.internal.execution.history.impl.DefaultImmutableWorkspaceMetadataStore");
metadataStore = storeClass.getDeclaredConstructor().newInstance();
metadataLoadMethod = storeClass.getMethod("loadWorkspaceMetadata", java.io.File.class);
} catch (ReflectiveOperationException | LinkageError e) {
throw new IllegalStateException("could not load Gradle metadata reader: " + summarize(e));
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: That looks a bit brittle, but I don't see it another way. That might be worth requesting a feature from Gradle folks (even if it takes years)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, Gradle actually should not crash on that, just log a warning and recover

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: tooling Build & Tooling tag: no release notes Changes to exclude from release notes type: bug Bug report and fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants