Alexeyk/gitlab cache metadata recovery#11515
Conversation
This comment has been minimized.
This comment has been minimized.
🟢 Java Benchmark SLOs — All performance SLOs passed
PR vs. master results
Commit: Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion. |
| # A partial/aborted GitLab cache extraction (the runner logs "FATAL: unexpected EOF" / | ||
| # "Failed to extract cache" then continues anyway) can leave a Gradle immutable-workspace | ||
| # (dependencies-accessors, groovy-dsl, kotlin-dsl, transforms) whose metadata.bin is either | ||
| # TRUNCATED (EOFException) or entirely MISSING (FileNotFoundException). Either way Gradle | ||
| # hard-fails during configuration with "Could not read workspace metadata" and does NOT | ||
| # self-heal. Capture evidence (good + damaged) for diagnosis, then use Gradle's own metadata | ||
| # reader to identify and drop only the damaged workspace dirs so Gradle regenerates them (only | ||
| # the entries that would fail anyway are removed). See gradle/gradle#28974. | ||
| - | | ||
| GRADLE_METADATA_EVIDENCE_DIR="gradle-cache-metadata" | ||
| mkdir -p "$GRADLE_METADATA_EVIDENCE_DIR" | ||
| # Manifest of every metadata.bin with its byte size (damaged ones sort to the top at 0/near-0). | ||
| find .gradle/caches -type f -name metadata.bin -printf '%10s %p\n' 2>/dev/null \ | ||
| | sort -n > "$GRADLE_METADATA_EVIDENCE_DIR/metadata-manifest.txt" || true | ||
| damaged="" | ||
| # ValidateGradleMetadata enumerates the immutable-workspace dirs, skips Gradle temporary | ||
| # workspaces, and prints damaged ones as "<size>\t<reason>\t<workspace>" (exit 1 via wrapper). | ||
| validator_output="$GRADLE_METADATA_EVIDENCE_DIR/validator-output.txt" | ||
| validator_error="$GRADLE_METADATA_EVIDENCE_DIR/validator-error.txt" | ||
| validator_status=0 | ||
| .gitlab/validate_gradle_metadata.sh "$GRADLE_VERSION" \ | ||
| > "$validator_output" 2> "$validator_error" || validator_status=$? | ||
| if [ "$validator_status" -eq 1 ]; then | ||
| while IFS=$'\t' read -r size reason ws; do | ||
| [ -n "$ws" ] || continue | ||
| [ -d "$ws" ] || continue | ||
| meta="${ws}/metadata.bin" | ||
| if [ -z "$damaged" ]; then | ||
| echo -e "${TEXT_BOLD}${TEXT_YELLOW}[WARNING] Damaged Gradle metadata found, fixed:${TEXT_CLEAR}" | ||
| damaged="yes" | ||
| fi | ||
| dest="$GRADLE_METADATA_EVIDENCE_DIR/corrupt/${ws}" | ||
| mkdir -p "$dest" || true | ||
| if [ -f "$meta" ]; then cp -p "$meta" "$dest/metadata.bin" 2>/dev/null || true; fi | ||
| ls -la "$ws" > "$dest/dir-listing.txt" 2>/dev/null || true | ||
| echo " - ${ws} (metadata.bin size=${size}; ${reason})" | ||
| rm -rf "$ws" || true | ||
| done < "$validator_output" | ||
| elif [ "$validator_status" -ne 0 ]; then | ||
| echo -e "${TEXT_BOLD}${TEXT_YELLOW}[WARNING] Gradle metadata validator unavailable; leaving cache unchanged.${TEXT_CLEAR}" | ||
| cat "$validator_error" || true | ||
| fi | ||
| if [ -z "$damaged" ]; then echo "No damaged Gradle immutable-workspace metadata detected."; fi | ||
| # Keep a few intact metadata.bin files for byte-level comparison with the damaged ones. | ||
| # `find | head` makes find die with SIGPIPE once head closes the pipe after 10 lines; under | ||
| # `set -o pipefail` (GitLab's bash default) that non-zero would propagate as this block's exit | ||
| # code and, being the shared default before_script, fail every build job. This is best-effort | ||
| # diagnostics and must never fail the job, so guard the pipeline with `|| true` like the rest. | ||
| find .gradle/caches -type f -name metadata.bin -size +32c 2>/dev/null | head -n 10 | while IFS= read -r f; do | ||
| dest="$GRADLE_METADATA_EVIDENCE_DIR/good/$(dirname "$f")" | ||
| mkdir -p "$dest" || true | ||
| cp -p "$f" "$dest/metadata.bin" 2>/dev/null || true | ||
| done || true |
There was a problem hiding this comment.
suggestion: Rather than having one part script one part java (.gitlab/validate_gradle_metadata.sh "$GRADLE_VERSION" ...), I would pack much into the the Java program, and name it something like CorruptGitlabGradleCacheExtractionMitigator.
| private static void loadMetadataReader() { | ||
| try { | ||
| var storeClass = | ||
| Class.forName("org.gradle.internal.execution.history.impl.DefaultImmutableWorkspaceMetadataStore"); | ||
| metadataStore = storeClass.getDeclaredConstructor().newInstance(); | ||
| metadataLoadMethod = storeClass.getMethod("loadWorkspaceMetadata", java.io.File.class); | ||
| } catch (ReflectiveOperationException | LinkageError e) { | ||
| throw new IllegalStateException("could not load Gradle metadata reader: " + summarize(e)); | ||
| } | ||
| } |
There was a problem hiding this comment.
note: That looks a bit brittle, but I don't see it another way. That might be worth requesting a feature from Gradle folks (even if it takes years)
There was a problem hiding this comment.
Yep, Gradle actually should not crash on that, just log a warning and recover
What Does This Do
A fix for when cache restore failed on GitHub:
FATAL: unexpected EOFMotivation
Quite a lot of jobs started to fail on GitLab randomly with error
FATAL: unexpected EOFsince May 22.After some investigation I found a way to workaround the issue by clearing corrupted
metadata.binfiles.That allows to restore cache as part of the build.
Additional Notes
Real issue should be fixed at GitLab. This PR is a way to unblock merging to master.