Skip to content

fetch: pass transport to post-fetch connectivity check#2123

Open
spkrka wants to merge 1 commit into
gitgitgadget:masterfrom
spkrka:fetch-transport-fix
Open

fetch: pass transport to post-fetch connectivity check#2123
spkrka wants to merge 1 commit into
gitgitgadget:masterfrom
spkrka:fetch-transport-fix

Conversation

@spkrka
Copy link
Copy Markdown

@spkrka spkrka commented May 24, 2026

We're working on reducing git fetch times on a large monorepo (2.4M
commits, 374K files, 10.9K local refs). Profiling showed the post-fetch
connectivity check (rev-list --objects --stdin --not --all) dominating
wall time when there are new objects.

While investigating, I noticed that check_connected() already has a
fast path for self-contained packs — it uses find_pack_entry_one() to
skip refs whose objects are in the new pack. builtin/clone.c passes
the transport to enable this, but store_updated_refs() in
builtin/fetch.c does not, making the optimization dead code for
fetches.

The fix is a three-line change to thread the transport through.

cc: Kristofer Karlsson krka@spotify.com

When fetching with a transport that sets `self_contained_and_connected`
(as index-pack does for self-contained packs), check_connected() can
use find_pack_entry_one() to skip connectivity verification for refs
whose objects exist in the new pack. This avoids sending those OIDs to
the rev-list child process.

However, store_updated_refs() never passed the transport to
check_connected(), so opt.transport was always NULL and this
optimization was dead code for post-fetch connectivity checks.

Thread the transport parameter through store_updated_refs() and set
opt.transport so that check_connected() can take advantage of
self-contained packs.

On a large repository (2.4M commits, 374K files, 10.9K local refs),
fetching 200 new commits:

  Before: rev-list connectivity check  22s,  total fetch  36s
  After:  rev-list connectivity check   5s,  total fetch  14s

The remaining 5s is spent verifying refs not contained in the new pack.

Signed-off-by: Kristofer Karlsson <krka@spotify.com>
@spkrka spkrka marked this pull request as ready for review May 24, 2026 12:27
@spkrka
Copy link
Copy Markdown
Author

spkrka commented May 24, 2026

/submit

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 24, 2026

Submitted as pull.2123.git.1779625693328.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-2123/spkrka/fetch-transport-fix-v1

To fetch this version to local tag pr-2123/spkrka/fetch-transport-fix-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-2123/spkrka/fetch-transport-fix-v1

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 24, 2026

Junio C Hamano wrote on the Git mailing list (how to reply to this email):

"Kristofer Karlsson via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> From: Kristofer Karlsson <krka@spotify.com>
>
> When fetching with a transport that sets `self_contained_and_connected`
> (as index-pack does for self-contained packs), check_connected() can
> use find_pack_entry_one() to skip connectivity verification for refs
> whose objects exist in the new pack. This avoids sending those OIDs to
> the rev-list child process.
>
> However, store_updated_refs() never passed the transport to
> check_connected(), so opt.transport was always NULL and this
> optimization was dead code for post-fetch connectivity checks.
>
> Thread the transport parameter through store_updated_refs() and set
> opt.transport so that check_connected() can take advantage of
> self-contained packs.
>
> On a large repository (2.4M commits, 374K files, 10.9K local refs),
> fetching 200 new commits:
>
>   Before: rev-list connectivity check  22s,  total fetch  36s
>   After:  rev-list connectivity check   5s,  total fetch  14s
>
> The remaining 5s is spent verifying refs not contained in the new pack.

Impressive.

The check_connected() function itself is a battle tested helper
function, with the optimization that originates in c6807a40 (clone:
open a shortcut for connectivity check, 2013-05-26), and then
polished in 26b974b3 (check_connected(): delay opening new_pack,
2026-03-05), allowing available "transport" to be taken into account
does make very good sense.

The other call to check_connected() that appear in builtin/fetch.c
does not pass opt.transport, either, but this one checks before we
even fetch any packs over any transport, so a tweak similar to this
patch would not help that code path, I guess.  In fact, many calls
to check_connected() elsewhere use opt that is often local to the
scope, that do not have transport at all.  I wonder if there are
some of them that benefit from a similar tweak?

Thanks.


>
> Signed-off-by: Kristofer Karlsson <krka@spotify.com>
> ---
>     fetch: pass transport to post-fetch connectivity check
>     
>     We're working on reducing git fetch times on a large monorepo (2.4M
>     commits, 374K files, 10.9K local refs). Profiling showed the post-fetch
>     connectivity check (rev-list --objects --stdin --not --all) dominating
>     wall time when there are new objects.
>     
>     While investigating, I noticed that check_connected() already has a fast
>     path for self-contained packs — it uses find_pack_entry_one() to skip
>     refs whose objects are in the new pack. builtin/clone.c passes the
>     transport to enable this, but store_updated_refs() in builtin/fetch.c
>     does not, making the optimization dead code for fetches.
>     
>     The fix is a three-line change to thread the transport through.
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2123%2Fspkrka%2Ffetch-transport-fix-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2123/spkrka/fetch-transport-fix-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/2123
>
>  builtin/fetch.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index a22c319467..647fd1c30c 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -1213,6 +1213,7 @@ N_("it took %.2f seconds to check forced updates; you can use\n"
>     "to avoid this check\n");
>  
>  static int store_updated_refs(struct display_state *display_state,
> +			      struct transport *transport,
>  			      int connectivity_checked,
>  			      struct ref_transaction *transaction, struct ref *ref_map,
>  			      struct fetch_head *fetch_head,
> @@ -1228,6 +1229,7 @@ static int store_updated_refs(struct display_state *display_state,
>  	if (!connectivity_checked) {
>  		struct check_connected_options opt = CHECK_CONNECTED_INIT;
>  
> +		opt.transport = transport;
>  		opt.exclude_hidden_refs_section = "fetch";
>  		rm = ref_map;
>  		if (check_connected(iterate_ref_map, &rm, &opt)) {
> @@ -1432,7 +1434,7 @@ static int fetch_and_consume_refs(struct display_state *display_state,
>  	}
>  
>  	trace2_region_enter("fetch", "consume_refs", the_repository);
> -	ret = store_updated_refs(display_state, connectivity_checked,
> +	ret = store_updated_refs(display_state, transport, connectivity_checked,
>  				 transaction, ref_map, fetch_head, config,
>  				 display_array);
>  	trace2_region_leave("fetch", "consume_refs", the_repository);
>
> base-commit: 6a4418c36d6bad69a599044b3cf49dcbd049cb45

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 24, 2026

Kristofer Karlsson wrote on the Git mailing list (how to reply to this email):

Good catch! After finding this case, I looked into the other related
call sites but found that they are already correct as-is:
- builtin/clone.c - already passes opt.transport (this is where I
copied it from)
- builtin/receive-pack.c (3 calls) - no transport object available to propagate
- fetch-pack.c - only used for the --deepen path, which sets
connectivity_checked when it passes,
  so the store_updated_refs() check is skipped entirely and transport
is not needed
- bundle.c - no need for transport

I am not 100% sure, but I suppose it's always possible to follow up
with more reuse of this later.

- Kristofer

On Sun, 24 May 2026 at 14:53, Junio C Hamano <gitster@pobox.com> wrote:
>
> "Kristofer Karlsson via GitGitGadget" <gitgitgadget@gmail.com>
> writes:
>
> > From: Kristofer Karlsson <krka@spotify.com>
> >
> > When fetching with a transport that sets `self_contained_and_connected`
> > (as index-pack does for self-contained packs), check_connected() can
> > use find_pack_entry_one() to skip connectivity verification for refs
> > whose objects exist in the new pack. This avoids sending those OIDs to
> > the rev-list child process.
> >
> > However, store_updated_refs() never passed the transport to
> > check_connected(), so opt.transport was always NULL and this
> > optimization was dead code for post-fetch connectivity checks.
> >
> > Thread the transport parameter through store_updated_refs() and set
> > opt.transport so that check_connected() can take advantage of
> > self-contained packs.
> >
> > On a large repository (2.4M commits, 374K files, 10.9K local refs),
> > fetching 200 new commits:
> >
> >   Before: rev-list connectivity check  22s,  total fetch  36s
> >   After:  rev-list connectivity check   5s,  total fetch  14s
> >
> > The remaining 5s is spent verifying refs not contained in the new pack.
>
> Impressive.
>
> The check_connected() function itself is a battle tested helper
> function, with the optimization that originates in c6807a40 (clone:
> open a shortcut for connectivity check, 2013-05-26), and then
> polished in 26b974b3 (check_connected(): delay opening new_pack,
> 2026-03-05), allowing available "transport" to be taken into account
> does make very good sense.
>
> The other call to check_connected() that appear in builtin/fetch.c
> does not pass opt.transport, either, but this one checks before we
> even fetch any packs over any transport, so a tweak similar to this
> patch would not help that code path, I guess.  In fact, many calls
> to check_connected() elsewhere use opt that is often local to the
> scope, that do not have transport at all.  I wonder if there are
> some of them that benefit from a similar tweak?
>
> Thanks.
>
>
> >
> > Signed-off-by: Kristofer Karlsson <krka@spotify.com>
> > ---
> >     fetch: pass transport to post-fetch connectivity check
> >
> >     We're working on reducing git fetch times on a large monorepo (2.4M
> >     commits, 374K files, 10.9K local refs). Profiling showed the post-fetch
> >     connectivity check (rev-list --objects --stdin --not --all) dominating
> >     wall time when there are new objects.
> >
> >     While investigating, I noticed that check_connected() already has a fast
> >     path for self-contained packs — it uses find_pack_entry_one() to skip
> >     refs whose objects are in the new pack. builtin/clone.c passes the
> >     transport to enable this, but store_updated_refs() in builtin/fetch.c
> >     does not, making the optimization dead code for fetches.
> >
> >     The fix is a three-line change to thread the transport through.
> >
> > Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-2123%2Fspkrka%2Ffetch-transport-fix-v1
> > Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-2123/spkrka/fetch-transport-fix-v1
> > Pull-Request: https://github.com/gitgitgadget/git/pull/2123
> >
> >  builtin/fetch.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/builtin/fetch.c b/builtin/fetch.c
> > index a22c319467..647fd1c30c 100644
> > --- a/builtin/fetch.c
> > +++ b/builtin/fetch.c
> > @@ -1213,6 +1213,7 @@ N_("it took %.2f seconds to check forced updates; you can use\n"
> >     "to avoid this check\n");
> >
> >  static int store_updated_refs(struct display_state *display_state,
> > +                           struct transport *transport,
> >                             int connectivity_checked,
> >                             struct ref_transaction *transaction, struct ref *ref_map,
> >                             struct fetch_head *fetch_head,
> > @@ -1228,6 +1229,7 @@ static int store_updated_refs(struct display_state *display_state,
> >       if (!connectivity_checked) {
> >               struct check_connected_options opt = CHECK_CONNECTED_INIT;
> >
> > +             opt.transport = transport;
> >               opt.exclude_hidden_refs_section = "fetch";
> >               rm = ref_map;
> >               if (check_connected(iterate_ref_map, &rm, &opt)) {
> > @@ -1432,7 +1434,7 @@ static int fetch_and_consume_refs(struct display_state *display_state,
> >       }
> >
> >       trace2_region_enter("fetch", "consume_refs", the_repository);
> > -     ret = store_updated_refs(display_state, connectivity_checked,
> > +     ret = store_updated_refs(display_state, transport, connectivity_checked,
> >                                transaction, ref_map, fetch_head, config,
> >                                display_array);
> >       trace2_region_leave("fetch", "consume_refs", the_repository);
> >
> > base-commit: 6a4418c36d6bad69a599044b3cf49dcbd049cb45

@gitgitgadget
Copy link
Copy Markdown

gitgitgadget Bot commented May 24, 2026

User Kristofer Karlsson <krka@spotify.com> has been added to the cc: list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant