Skip to content

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Nov 27, 2025

Overview

Resolves #2609

The basic idea is to store an additional DAHightHint field within the p2p store.

As SignedHeader and Data are used in other places, too. I added an an DAHeightHintContainer type to wrap the tuple for the store only.
The DAHeight Hint is added by the da_submitter and read in the syncer to fetch the missing DA header/data for the most recent block as required.

Please note: this is a breaking change to the p2p network and store

@claude
Copy link
Contributor

claude bot commented Nov 27, 2025

Claude finished @alpe's task in 4m 5s —— View job


Comprehensive Code Review

I've completed a thorough review of the DA hints implementation. This is a well-designed feature that adds DA height hints to P2P messages to enable efficient targeted retrieval of DA data. Here are my findings:


Architecture & Design ✅

Strengths:

  • Binary compatibility approach is excellent (types/p2p_envelope.go:92-153): Using optional proto fields (optional uint64 da_height_hint) enables backward compatibility with existing nodes that don't understand hints. Tests confirm this works both ways (p2p_envelope_test.go:92-153).
  • Clean separation of concerns: The P2PEnvelope[H] generic wrapper keeps DA hints isolated to p2p layer without polluting core types.
  • Async DA retriever pattern (block/internal/syncing/async_da_retriever.go): Worker pool with deduplication is a solid design for non-blocking DA fetches.

Concerns:

  1. Proto mirror types create maintenance burden (types/p2p_envelope.go:86-153)

    • MarshalBinary/UnmarshalBinary manually construct pb.P2PData/pb.P2PSignedHeader from existing proto types
    • Any proto schema change requires updating 4 places (proto file, marshal, unmarshal, tests)
    • Consider: Can the P2PEnvelope types directly use the proto definitions instead of wrapping existing types?
  2. Type switch limitations (types/p2p_envelope.go:89, 116)

    • Only supports *Data and *SignedHeader - will panic on other types
    • The generic H header.Header[H] signature promises broader support than delivered
    • Recommendation: Either restrict generic parameter to known types, or add runtime type registry

Concurrency & Goroutine Safety ⚠️

Critical Issue - Resource Leak:

Problem: AsyncDARetriever.Stop() never gets called (block/internal/syncing/async_da_retriever.go:50-57)

// In syncer.go:193-194
s.asyncDARetriever = NewAsyncDARetriever(s.daRetriever, s.heightInCh, s.logger)
s.asyncDARetriever.Start(s.ctx)

// But in syncer.go:222-229, Stop() method:
func (s *Syncer) Stop() error {
    if s.cancel != nil {
        s.cancel()  // This cancels context
    }
    s.cancelP2PWait(0)
    s.wg.Wait()  // Waits for syncer goroutines
    // ❌ Missing: s.asyncDARetriever.Stop()
    return nil
}

Impact:

  • 5 worker goroutines leak on shutdown
  • workCh channel never closes, workers block forever on <-r.workCh
  • Context cancellation helps (workers will exit on ctx.Done), but AsyncDARetriever.wg.Wait() never completes
  • Any code calling asyncDARetriever.Stop() would deadlock waiting for workers that are blocked

Fix Required:

func (s *Syncer) Stop() error {
    if s.cancel != nil {
        s.cancel()
    }
    s.asyncDARetriever.Stop()  // Add this line
    s.cancelP2PWait(0)
    s.wg.Wait()
    return nil
}

Other Concurrency Observations:

Good patterns:

  • P2PHandler.processedHeight uses atomic CAS loop correctly (p2p_handler.go:62-72)
  • AsyncDARetriever.inFlight map properly protected by mutex
  • Request deduplication in RequestRetrieval prevents duplicate work (async_da_retriever.go:61-76)

⚠️ Potential issues:

  • AsyncDARetriever.RequestRetrieval drops requests when buffer full (async_da_retriever.go:74) - this is logged but may be surprising behavior. Consider if this is acceptable or if backpressure should propagate.

Error Handling & Resilience ⚠️

Issues:

  1. Silent failures in async retrieval (async_da_retriever.go:98-102)

    events, err := r.retriever.RetrieveFromDA(r.ctx, height)
    if err != nil {
        r.logger.Debug().Err(err).Uint64("height", height).Msg("async DA retrieval failed")
        return  // ❌ Request dropped silently
    }
    • Failed retrievals are logged at DEBUG level and discarded
    • No retry mechanism
    • No visibility to caller that retrieval failed
    • Recommendation: Consider retry with exponential backoff, or at minimum log at WARN/ERROR level
  2. Ignored errors in DA hint appending (da_submitter.go:201-204)

    if err := s.headerDAHintAppender.AppendDAHint(ctx, res.Height, hashes...); err != nil {
        s.logger.Error().Err(err).Msg("failed to append da height hint in header p2p store")
        // ignoring error here, since we don't want to block the block submission'
    }
    • Comment has typo ("block the block submission'") and unclear reasoning
    • If hints don't persist, the feature degrades silently
    • Recommendation: Clarify if this is acceptable graceful degradation or if metrics/alerts should track this
  3. Partial failure handling (sync_service.go:184-198)

    • AppendDAHint silently skips missing entries with header.ErrNotFound
    • Could lead to incomplete hint storage
    • Consider logging when entries are skipped

Testing Coverage 📊

Strengths:

  • Binary compatibility tests are excellent (p2p_envelope_test.go:92-153)
  • Async DA retriever has good unit tests covering deduplication and worker pool limits
  • Integration tests cover the full flow (sync_service_test.go:200-252)

Gaps:

  1. No test for asyncDARetriever.Stop() cleanup

    • Tests in async_da_retriever_test.go call defer asyncRetriever.Stop() but don't verify workers actually stopped
    • Add test that verifies Stop() completes within reasonable timeout
  2. Error path coverage missing

    • No test for DA retrieval failures in async retriever
    • No test for AppendDAHint failures
    • No test for malformed proto messages with invalid hints
  3. Race condition testing

    • Consider adding tests with -race flag for concurrent hint updates
    • Test concurrent RequestRetrieval calls for same height
  4. Migration testing

    • No test verifying old nodes can read new messages (covered by unit tests)
    • No test verifying new nodes handle old messages without hints (covered by unit tests)
    • ✅ Binary compatibility tests cover these scenarios well

Security Considerations 🔒

Good:

  • DA height hints are advisory only - always validated against actual DA data
  • No trust in p2p-provided hints (hints just optimize retrieval path)
  • Protobuf schema changes are backward compatible

Potential concerns:

  • DoS via hint spam: Malicious node could send many blocks with invalid DA hints, causing wasteful DA queries

    • Mitigated by: async retriever has bounded worker pool (5 workers) and buffered channel (100)
    • Consider: Add rate limiting or reputation tracking for peers sending bad hints
  • Unbounded memory in inFlight map: If DA queries hang, inFlight map could grow without bound

    • Current: Map entries only removed when query completes
    • Recommendation: Add timeout cleanup for stale entries

Performance & Resource Efficiency ⚠️

Good:

  • Worker pool limits concurrent DA queries (5 workers)
  • Buffered channels prevent blocking (workCh: 100, heightInCh: 100)
  • Deduplication prevents redundant DA fetches

Concerns:

  1. Magic numbers (async_da_retriever.go:33, 43)

    workCh:    make(chan uint64, 100), // Buffer size 100
    // Start 5 workers
    for i := 0; i < 5; i++ {
    • Buffer size 100 and 5 workers are hardcoded
    • No justification or configuration option
    • Recommendation: Make these configurable or document sizing rationale
  2. Channel sends can block (async_da_retriever.go:105-109)

    for _, event := range events {
        select {
        case r.resultCh <- event:  // Can block if downstream is slow
        case <-r.ctx.Done():
            return
        }
    }
    • Workers can block if heightInCh consumer is slow
    • Consider adding timeout or non-blocking send with warning
  3. P2P store writes on DA sync path (syncer.go:584-599)

    • Every DA-sourced block triggers p2p store writes
    • Uses errgroup but errors only logged, not handled
    • Could create backpressure during bulk sync

Code Quality & Maintainability 🔧

Good:

  • Clear documentation and comments
  • Consistent naming conventions
  • Good use of generics for type safety

Issues:

  1. TODO comment should be removed (syncer.go:584)

    if event.Source == common.SourceDA { // TODO(@julienrbrt): To be reverted once DA Hints are merged (https://github.com/evstack/ev-node/pull/2891)
  2. Test helper duplication

    • TestDAHintStorageHeader and TestDAHintStorageData have near-identical setup code
    • Consider extracting common test fixtures
  3. Comment typos:

    • "block the block submission'" (da_submitter.go:203)
    • Extra apostrophe should be removed

Migration Strategy 🔄

Current approach (per PR comments):

  • New P2PSignedHeader and P2PData proto types with optional da_height_hint field
  • Binary compatible with old SignedHeader and Data types
  • Old nodes: ignore unknown fields (proto3 default)
  • New nodes: treat missing hints as zero value

Strengths:

  • No network flag day required
  • Gradual rollout possible
  • In-flight messages compatible

Remaining concerns:

  • P2P store migration not addressed: Existing stores have SignedHeader/Data, new code expects P2PSignedHeader/P2PData
    • What happens when node with existing store upgrades?
    • Are old entries readable by new code? (Tests suggest yes via binary compatibility)
    • Recommendation: Add migration guide or automated migration in store initialization

Summary & Recommendations

Critical (Must Fix):

  1. Call asyncDARetriever.Stop() in syncer shutdown (block/internal/syncing/syncer.go:222)
  2. ⚠️ Remove outdated TODO comment (syncer.go:584)

High Priority (Should Fix):

  1. ⚠️ Improve async DA retrieval error handling - retry logic or escalate log level
  2. ⚠️ Make worker pool size and buffer sizes configurable
  3. ⚠️ Add timeout cleanup for stale inFlight map entries
  4. ⚠️ Add tests for Stop() cleanup and error paths

Medium Priority (Consider):

  1. 💭 Document or reconsider type switch limitations in P2PEnvelope
  2. 💭 Add rate limiting for invalid DA hints from peers
  3. 💭 Consider backpressure propagation when request queue full
  4. 💭 Document P2P store migration strategy

Low Priority (Nice to Have):

  1. Fix typo in da_submitter.go:203 comment
  2. Extract common test fixtures

Overall Assessment

This is a well-designed feature that adds valuable optimization to DA data retrieval. The binary compatibility approach is excellent and the async retrieval pattern is solid. However, the critical resource leak from missing Stop() call must be fixed before merge.

With the critical issue fixed and error handling improved, this PR will be production-ready. The architecture is sound and the implementation demonstrates good Go practices.

Recommendation: Request changes for critical issues, then approve after fixes.


Reviewed Files

  • types/p2p_envelope.go
  • types/p2p_envelope_test.go
  • block/internal/syncing/async_da_retriever.go
  • block/internal/syncing/async_da_retriever_test.go
  • block/internal/syncing/syncer.go
  • block/internal/syncing/p2p_handler.go
  • block/internal/submitting/da_submitter.go
  • pkg/sync/sync_service.go
  • pkg/sync/sync_service_test.go
  • proto/evnode/v1/evnode.proto
  • block/internal/common/event.go
  • block/internal/common/expected_interfaces.go

@github-actions
Copy link
Contributor

github-actions bot commented Nov 27, 2025

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedDec 22, 2025, 11:25 AM

@alpe alpe changed the title Alex/hints feat: DA hints in p2p Nov 28, 2025
@codecov
Copy link

codecov bot commented Nov 28, 2025

Codecov Report

❌ Patch coverage is 70.04405% with 68 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.54%. Comparing base (d386df5) to head (993c2b1).

Files with missing lines Patch % Lines
types/p2p_envelope.go 53.42% 29 Missing and 5 partials ⚠️
block/internal/syncing/syncer.go 58.62% 12 Missing ⚠️
pkg/sync/sync_service.go 69.44% 7 Missing and 4 partials ⚠️
block/internal/syncing/async_da_retriever.go 86.53% 6 Missing and 1 partial ⚠️
block/internal/submitting/da_submitter.go 80.95% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2891      +/-   ##
==========================================
+ Coverage   59.18%   59.54%   +0.36%     
==========================================
  Files          90       92       +2     
  Lines        8627     8813     +186     
==========================================
+ Hits         5106     5248     +142     
- Misses       2940     2975      +35     
- Partials      581      590       +9     
Flag Coverage Δ
combined 59.54% <70.04%> (+0.36%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

alpe added 3 commits November 28, 2025 17:20
* main:
  refactor: omit unnecessary reassignment (#2892)
  build(deps): Bump the all-go group across 5 directories with 6 updates (#2881)
  chore: fix inconsistent method name in retryWithBackoffOnPayloadStatus comment (#2889)
  fix: ensure consistent network ID usage in P2P subscriber (#2884)
cache.SetHeaderDAIncluded(headerHash.String(), res.Height, header.Height())
hashes[i] = headerHash
}
if err := s.headerDAHintAppender.AppendDAHint(ctx, res.Height, hashes...); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the DA height is passed to the sync service to update the p2p store

Msg("P2P event with DA height hint, triggering targeted DA retrieval")

// Trigger targeted DA retrieval in background via worker pool
s.asyncDARetriever.RequestRetrieval(daHeightHint)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the "fetch from DA" is triggered for the current block event height

type SignedHeaderWithDAHint = DAHeightHintContainer[*types.SignedHeader]
type DataWithDAHint = DAHeightHintContainer[*types.Data]

type DAHeightHintContainer[H header.Header[H]] struct {
Copy link
Contributor Author

@alpe alpe Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a data container to persist the DA hint together with the block header or data.
types.SignedHeader and types.Data are used all over the place so I did not modify them but added introduced this type for the p2p store and transfer only.

It may make sense to do make this a Proto type. WDYT?

return nil
}

func (s *SyncService[V]) AppendDAHint(ctx context.Context, daHeight uint64, hashes ...types.Hash) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stores the DA height hints

@alpe alpe marked this pull request as ready for review December 1, 2025 09:32
@tac0turtle
Copy link
Contributor

if da hint is not in the proto how do other nodes get knowledge of the hint?

also how would an existing network handle using this feature? its breaking so is it safe to upgrade?

"github.com/evstack/ev-node/block/internal/cache"
"github.com/evstack/ev-node/block/internal/common"
"github.com/evstack/ev-node/block/internal/da"
coreda "github.com/evstack/ev-node/core/da"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: gci linter

Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! It really makes sense.

I share the same concern as @tac0turtle however about the upgrade strategy given it is p2p breaking.

julienrbrt
julienrbrt previously approved these changes Dec 2, 2025
@alpe
Copy link
Contributor Author

alpe commented Dec 2, 2025

if da hint is not in the proto how do other nodes get knowledge of the hint?

The sync_service wraps the header/data payload in a DAHeightHintContainer object that is passed upstream to the p2p layer. When the DA height is known, the store is updated.

also how would an existing network handle using this feature? its breaking so is it safe to upgrade?

It is a breaking change. Instead of signed header or data types, the p2p network exchanges DAHeightHintContainer. This would be incompatible. Also the existing p2p stores would need migration to work.

@julienrbrt
Copy link
Member

julienrbrt commented Dec 4, 2025

Could we broadcast both until every networks are updated? Then for final we can basically discard the previous one.

@alpe
Copy link
Contributor Author

alpe commented Dec 5, 2025

fyi: This PR is missing a migration strategy for the p2p store ( and ideally network)

* main:
  refactor(sequencers): persist prepended batch (#2907)
  feat(evm): add force inclusion command (#2888)
  feat: DA client, remove interface part 1: copy subset of types needed for the client using blob rpc. (#2905)
  feat: forced inclusion (#2797)
  fix: fix and cleanup metrics (sequencers + block) (#2904)
  build(deps): Bump mdast-util-to-hast from 13.2.0 to 13.2.1 in /docs in the npm_and_yarn group across 1 directory (#2900)
  refactor(block): centralize timeout in client (#2903)
  build(deps): Bump the all-go group across 2 directories with 3 updates (#2898)
  chore: bump default timeout (#2902)
  fix: revert default db (#2897)
  refactor: remove obsolete // +build tag (#2899)
  fix:da visualiser namespace  (#2895)
alpe added 3 commits December 15, 2025 10:52
* main:
  chore: execute goimports to format the code (#2924)
  refactor(block)!: remove GetLastState from components (#2923)
  feat(syncing): add grace period for missing force txs inclusion (#2915)
  chore: minor improvement for docs (#2918)
  feat: DA Client remove interface part 2,  add client for celestia blob api   (#2909)
  chore: update rust deps (#2917)
  feat(sequencers/based): add based batch time (#2911)
  build(deps): Bump golangci/golangci-lint-action from 9.1.0 to 9.2.0 (#2914)
  refactor(sequencers): implement batch position persistance (#2908)
github-merge-queue bot pushed a commit that referenced this pull request Dec 15, 2025
<!--
Please read and fill out this form before submitting your PR.

Please make sure you have reviewed our contributors guide before
submitting your
first PR.

NOTE: PR titles should follow semantic commits:
https://www.conventionalcommits.org/en/v1.0.0/
-->

## Overview

Temporary fix until #2891.
After #2891 the verification for p2p blocks will be done in the
background.

ref: #2906

<!-- 
Please provide an explanation of the PR, including the appropriate
context,
background, goal, and rationale. If there is an issue with this
information,
please provide a tl;dr and link the issue. 

Ex: Closes #<issue number>
-->
@alpe
Copy link
Contributor Author

alpe commented Dec 15, 2025

I have added 2 new types for the p2p store that are binary compatible to the types.Data and SignedHeader. With this, we should be able to roll this out without breaking the in-flight p2p data and store.

alpe added 3 commits December 15, 2025 14:49
* main:
  fix(syncing): skip forced txs checks for p2p blocks (#2922)
  build(deps): Bump the all-go group across 5 directories with 5 updates (#2919)
  chore: loosen syncer state check (#2927)
@alpe alpe requested a review from julienrbrt December 15, 2025 15:00
julienrbrt
julienrbrt previously approved these changes Dec 15, 2025
Copy link
Member

@julienrbrt julienrbrt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! I can see how useful the async retriever will be for force inclusion verification as well. We should have @auricom verify if p2p still works with Eden.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be really useful for force inclusion checks as well.

* main:
  build(deps): Bump actions/cache from 4 to 5 (#2934)
  build(deps): Bump actions/download-artifact from 6 to 7 (#2933)
  build(deps): Bump actions/upload-artifact from 5 to 6 (#2932)
  feat: DA Client remove interface part 3, replace types with new code (#2910)
  DA Client remove interface: Part 2.5, create e2e test to validate that a blob is posted in DA layer. (#2920)
julienrbrt
julienrbrt previously approved these changes Dec 16, 2025
alpe added 3 commits December 19, 2025 17:00
* main:
  feat: use DA timestamp (#2939)
  chore: improve code comments clarity (#2943)
  build(deps): bump libp2p (#2937)
(cherry picked from commit ad3e21b)
julienrbrt
julienrbrt previously approved these changes Dec 19, 2025
* main:
  fix: make evm_execution more robust (#2942)
  fix(sequencers/single): deterministic queue (#2938)
  fix(block): fix init logic sequencer for da epoch fetching (#2926)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sync: P2P should provide da inclusion hints

4 participants