A practical theoretically faster variant of the Euclidean GCD algorithm

This post was originally written on Codeforces; relevant discussion can be found here.

This post is not about the binary GCD algorithm, but rather about exploring theoretical guarantees for an optimization on top of the standard Euclidean algorithm. The STL version is faster than both these algorithms, because it uses the binary GCD algorithm, but these algorithms are of interest from theoretical considerations.

In particular, my conjecture is that the second algorithm takes at most as many iterations as the first one, and if true, it’d be a pretty surprising claim, given how hard it is to bound Euclidean algorithm runtimes. So, it would be really cool if someone could prove this property.

UPD: thanks VArtem for affirming that this is the case and pointing to some relevant references.

The usual gcd algorithm looks something like this:

auto gcd = []<std::integral T>(T a, T b) -> T {
    while (b) {
        a = std::exchange(b, a % b);
    }
    return std::abs(a);
};

Consider the following implementation, that makes the second argument greedily smaller.

auto gcd2 = []<std::integral T>(T a, T b) -> T {
    a = std::abs(a);
    b = std::abs(b);
    while (b) {
        T r = a % b;
        if (r > (b >> 1)) r = b - r;
        a = std::exchange(b, r); // a = std::exchange(b, std::min(a % b, b - a % b); also works fine
    }
    return a;
};

On random data, gcd2 seems to be around 30% faster than gcd on my machine, for int32_t and int64_t arguments.

And it doesn’t seem immediately obvious, but on all inputs I brute-forced, it seems gcd2 always takes an equal or smaller number of iterations compared to gcd. A proof is outlined in the comments.

Here’s the benchmarking code I used:

Benchmarking code

#include "bits/stdc++.h"

template <typename C = std::chrono::steady_clock, typename T1 = std::chrono::nanoseconds, typename T2 = std::chrono::milliseconds>
struct Stopwatch {
    std::string_view name;
    std::chrono::time_point<C> last_played;
    T1 elapsed_time;
    bool running{};
    Stopwatch(std::string_view s) : name{s}, running{true} { reset(); }
    Stopwatch() : Stopwatch("Time") {}
    void reset() {
        last_played = C::now();
        elapsed_time = T1::zero();
    }
    void pause() {
        if (!running) return;
        running = false;
        elapsed_time += std::chrono::duration_cast<T1>(C::now() - last_played);
    }
    void play() {
        if (running) return;
        running = true;
        last_played = C::now();
    }
    int64_t elapsed() const {
        return std::chrono::duration_cast<T2>(elapsed_time + (running ? std::chrono::duration_cast<T1>(C::now() - last_played) : T1::zero())).count();
    }
    void print() const { std::cerr << name << ": " << elapsed() << " ms\n"; }
    ~Stopwatch() { print(); }
};

auto gcd = []<std::integral T>(T a, T b) -> T {
    while (b) {
        a = std::exchange(b, a % b);
    }
    return std::abs(a);
};

auto gcd2 = []<std::integral T>(T a, T b) -> T {
    a = std::abs(a);
    b = std::abs(b);
    while (b) {
        T r = a % b;
        if (r > (b >> 1)) r = b - r;
        a = std::exchange(b, r);
    }
    return a;
};

template <typename T, typename F>
void run(const std::vector<std::pair<T, T>>& queries, F&& f, std::string_view name) {
    Stopwatch sw{name};
    T ans = 0;
    for (auto& [a, b] : queries) ans ^= f(a, b);
    std::cout << ans << '\n';
}

template <typename T>
auto get_random(size_t n) {
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<T> dist(0, std::numeric_limits<T>::max());
    std::vector<std::pair<T, T>> v;
    while (n--) {
        auto n1 = dist(gen), n2 = dist(gen);
        v.emplace_back(n1, n2);
    }
    return v;
}

int main() {
    constexpr int N = 1e8;
    auto a = get_random<int32_t>(N);
    auto b = get_random<int64_t>(N);
    run(a, gcd, "gcd_32");
    run(a, gcd2, "gcd2_32");
    run(b, gcd, "gcd_64");
    run(b, gcd2, "gcd2_64");
}

And the timing results I get are:

190237720
gcd_32: 14054 ms
190237720
gcd2_32: 11064 ms
13318666
gcd_64: 47618 ms
13318666
gcd2_64: 36981 ms

When I compare this new algorithm with std::gcd on similar data, there seems to be barely any slowdown with 32 bit ints, but the % operator makes it much slower on 64 bit ints.

Do let me know what you think about it in the comments, and a proof/counterexample would be appreciated!