From 3656f842789d25d75da41c6c029470052a573b54 Mon Sep 17 00:00:00 2001 From: Elijah Newren Date: Sat, 4 Dec 2021 05:35:52 +0000 Subject: name-rev: prefer shorter names over following merges MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit name-rev has a MERGE_TRAVERSAL_WEIGHT to say that traversing a second or later parent of a merge should be 65535 times more expensive than a first-parent traversal, as per ac076c29ae8d (name-rev: Fix non-shortest description, 2007-08-27). The point of this weight is to prefer names like v2.32.0~1471^2 over names like v2.32.0~43^2~15^2~11^2~20^2~31^2 which are two equally valid names in git.git for the same commit. Note that the first follows 1472 parent traversals compared to a mere 125 for the second. Weighting all traversals equally would clearly prefer the second name since it has fewer parent traversals, but humans aren't going to be traversing commits and they tend to have an easier time digesting names with fewer segments. The fact that the former only has two segments (~1471, ^2) makes it much simpler than the latter which has six segments (~43, ^2, ~15, etc.). Since name-rev is meant to "find symbolic names suitable for human digestion", we prefer fewer segments. However, the particular rule implemented in name-rev would actually prefer v2.33.0-rc0~11^2~1 over v2.33.0-rc0~20^2 because both have precisely one second parent traversal, and it gives the tie breaker to shortest number of total parent traversals. Fewer segments is more important for human consumption than number of hops, so we'd rather see the latter which has one fewer segment. Include the generation in is_better_name() and use a new effective_distance() calculation so that we prefer fewer segments in the printed name over fewer total parent traversals performed to get the answer. == Side-note on tie-breakers == When there are the same number of segments for two different names, we actually use the name of an ancestor commit as a tie-breaker as well. For example, for the commit cbdca289fb in the git.git repository, we prefer the name v2.33.0-rc0~112^2~1 over v2.33.0-rc0~57^2~5. This is because: * cbdca289fb is the parent of 25e65b6dd5, which implies the name for cbdca289fb should be the first parent of the preferred name for 25e65b6dd5 * 25e65b6dd5 could be named either v2.33.0-rc0~112^2 or v2.33.0-rc0~57^2~4, but the former is preferred over the latter due to fewer segments * combine the two previous facts, and the name we get for cbdca289fb is "v2.33.0-rc0~112^2~1" rather than "v2.33.0-rc0~57^2~5". Technically, we get this for free out of the implementation since we only keep track of one name for each commit as we walk history (and re-add parents to the queue if we find a better name for those parents), but the first bullet point above ensures users get results that feel more consistent. == Alternative Ideas and Meanings Discussed == One suggestion that came up during review was that shortest string-length might be easiest for users to consume. However, such a scheme would be rather computationally expensive (we'd have to track all names for each commit as we traversed the graph) and would additionally come with the possibly perplexing result that on a linear segment of history we could rapidly swap back and forth on names: MYTAG~3^2 would be preferred over MYTAG~9998 MYTAG~3^2~1 would NOT be preferred over MYTAG~9999 MYTAG~3^2~2 might be preferred over MYTAG~10000 Another item that came up was possible auxiliary semantic meanings for name-rev results either before or after this patch. The basic answer was that the previous implementation had no known useful auxiliary semantics, but that for many repositories (most in my experience), the new scheme does. In particular, the new name-rev output can often be used to answer the question, "How or when did this commit get merged?" Since that usefulness depends on how merges happen within the repository and thus isn't universally applicable, details are omitted here but you can see them at [1]. [1] https://lore.kernel.org/git/CABPp-BEeUM+3NLKDVdak90_UUeNghYCx=Dgir6=8ixvYmvyq3Q@mail.gmail.com/ Finally, it was noted that the algorithm could be improved by just explicitly tracking the number of segments and using both it and distance in the comparison, instead of giving a magic number that tries to blend the two (and which therefore might give suboptimal results in repositories with really huge numbers of commits that periodically merge older code). However, "[this patch] seems to give us a much better results than the current code, so let's take it and leave further futzing outside the scope." Signed-off-by: Elijah Newren Acked-by: Ævar Arnfjörð Bjarmason Acked-by: Johannes Schindelin Signed-off-by: Junio C Hamano --- builtin/name-rev.c | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/builtin/name-rev.c b/builtin/name-rev.c index b221d30014..27f60153a6 100644 --- a/builtin/name-rev.c +++ b/builtin/name-rev.c @@ -44,11 +44,20 @@ static struct rev_name *get_commit_rev_name(const struct commit *commit) return is_valid_rev_name(name) ? name : NULL; } +static int effective_distance(int distance, int generation) +{ + return distance + (generation > 0 ? MERGE_TRAVERSAL_WEIGHT : 0); +} + static int is_better_name(struct rev_name *name, timestamp_t taggerdate, + int generation, int distance, int from_tag) { + int name_distance = effective_distance(name->distance, name->generation); + int new_distance = effective_distance(distance, generation); + /* * When comparing names based on tags, prefer names * based on the older tag, even if it is farther away. @@ -56,7 +65,7 @@ static int is_better_name(struct rev_name *name, if (from_tag && name->from_tag) return (name->taggerdate > taggerdate || (name->taggerdate == taggerdate && - name->distance > distance)); + name_distance > new_distance)); /* * We know that at least one of them is a non-tag at this point. @@ -69,8 +78,8 @@ static int is_better_name(struct rev_name *name, * We are now looking at two non-tags. Tiebreak to favor * shorter hops. */ - if (name->distance != distance) - return name->distance > distance; + if (name_distance != new_distance) + return name_distance > new_distance; /* ... or tiebreak to favor older date */ if (name->taggerdate != taggerdate) @@ -88,7 +97,7 @@ static struct rev_name *create_or_update_name(struct commit *commit, struct rev_name *name = commit_rev_name_at(&rev_names, commit); if (is_valid_rev_name(name)) { - if (!is_better_name(name, taggerdate, distance, from_tag)) + if (!is_better_name(name, taggerdate, generation, distance, from_tag)) return NULL; /* -- cgit v1.2.3