Thursday, July 24, 2025

F43 Change Proposal: Hardlink identical files in packages by default (self-contained)

Wiki - https://fedoraproject.org/wiki/Changes/Hardlink_identical_files_in_packages_by_default
Discussion thread -
https://discussion.fedoraproject.org/t/f43-change-proposal-hardlink-identical-files-in-packages-by-default-self-contained/160769

This is a proposed Change for Fedora Linux.
This document represents a proposed Change. As part of the Changes
process, proposals are publicly announced in order to receive
community feedback. This proposal will only be implemented if approved
by the Fedora Engineering Steering Committee.

== Summary ==
A post-build step is added to the package build macros to
automatically hardlink all identical files under `/usr`. Previously,
this was done in some packages and now it's done everywhere by
default.

== Owner ==
* Name: [[User:zbyszek|Zbigniew Jędrzejewski-Szmek]]
* Email: zbyszek at in.waw.pl


== Detailed Description ==
Files can be hardlinked at the end of the `%install` step in package
builds. rpm supports this and will preserve those links in the binary
rpm and during installation. This makes the installation a bit more
efficient. Hardlinking of read-only files is generally transparent to
the user, but has some small benefits: the files are not duplicated in
the file system; backup, copy, and search programs will usually make
use of the link information and not process the same inode twice.
Thus, it's good to hardlink as many packaged files as possible.

Previously, hardlinking was done automatically for a subset of files
in Python packages (via the `%__os_install_post_python` macro), and
explicitly in some packages with lots of similar files (usually via
the `hardlink` program).

The `%__os_install_post` is extended to automatically hardlink all
identical files under `%{buildroot}%{_prefix}`, i.e. the `/usr`
directory in packages. This calls a new helper binary (part of the
`add-determinism` package) that does the linking.

Hard links may be confusing if the file is ''modified''. In
particular, all links to the same inode share the same ownership and
permissions, and obviously the same contents. Thus, we want to apply
hardlinking only to files under `/usr`, which are generally read-only
in packages.

When files are hardlinked, mtime (the modification timestamp) is taken
into account. Only files with identical mtime, owner, group, and mode
are subject to linking. The new program written to do the linking
takes `$SOURCE_DATE_EPOCH` into account, and will clamp mtimes to it
before comparing.

Note: rpm correctly handles the case where a hardlink is between files
in two different subpackages. Thus, we can hardlink everything under
`%{buildroot}`, and rpm will store the files as hardlinked if they are
in the same output package, adjusting the hardlink counts as
appropriate.

== Feedback ==
<!-- Summarize the feedback from the community and address why you
chose not to accept proposed alternatives. This section is optional
for all change proposals but is strongly suggested. Incorporating
feedback here as it is raised gives FESCo a clearer view of your
proposal and leaves a good record for the future. If you get no
feedback, that is useful to note in this section as well. For
innovative or possibly controversial ideas, consider collecting
feedback before you file the change proposal. -->

== Benefit to Fedora ==
As mentioned in the Summary, hardlinking deduplicates the data in rpms
and in installations. Backup, copy, and search programs will usually
make use of the link information and not process the same inode twice.
Thus, by hardlinking files in the packages we make things a bit more
efficient. (The impact is small, because rpms generally don't have
large duplicated files.)

Hardlinking of files was previously done in some packages explicitly,
but it required adding a `BuildRequires` line and invoking a script,
so it wasn't done very often. By handling this automatically, we'll be
able to simplify those packages.

Another caveat that needs to be taken into account when doing
hardlinking as part of the package build is that newer `hardlink`
versions use reflinks instead of hardlinks by default. (With a
hardlink, one inode is connected to the file system tree in two or
more places. With a reflink, some blocks of an inode are shared with
another inode, ''inside'' of the file system, and the two inodes
retain their separate identities.) rpm has no knowledge of reflinks,
so those reflinks created during package build have no effect on the
binary package and the payload is duplicated. Invocations of
`hardlink` would have to be annotated with `--reflink=never` to retain
the intended effect. By removing that step from packages we avoid this
issue.

The [https://docs.fedoraproject.org/en-US/reproducible-builds/
Reproducible Builds] effort reported that some packages that use
hardlinking are not reproducible, see
[https://pagure.io/fedora-reproducible-builds/project/issue/22
irreproducibility#22
]. When files are created in the package build,
depending on how fast the build machine is, some files might or might
not have identical timestamps. The tools that were used to compare
files for hardlinking were general tools that did not "know" that we'd
clamp the mtimes to `$SOURCE_DATE_EPOCH` in a subsequent step, so the
results of the mtime comparisons were unstable. The tool that is added
as part of this Change does the mtime clamping internally for
reproducible results. Fixing this issue was the initial motivation for
this change.

== Scope ==
* Proposal owners:
** extend the `add-determinism` package with a little helper that does
file comparisons and hardlinks identical files. The helper takes
`$SOURCE_DATE_EPOCH` into account.
** open pull request for `redhat-rpm-config` to insert a call to the
helper in `%__os_install_post`.
** open pull request for `python-srpm-macros` to drop their hardlinking step.
* Other developers:
** merge pull request
** report issues if the hardlinking has unforeseen consequences or
does not work correctly.
** drop explicit calls to `hardlink` in their packages.

* Release engineering:

* Policies and guidelines: not needed, AFAICT.

* Trademark approval: N/A (not needed for this Change)


* Alignment with the Fedora Strategy:


== Upgrade/compatibility impact ==
No impact.

== Early Testing (Optional) ==
Build package with an invocation of the new helper.

== How To Test ==
Install packages rebuilt with the helper.

== User Experience ==
Not visible to users.

== Dependencies ==

== Contingency Plan ==
* Contingency mechanism:
** if hardlinking causes a problem in some specific packages, they can
be trivially modified to skip the hardlinking step by setting a macro.
** if there is a general problem, we can easily drop the macro in
`redhat-rpm-config`.
* Contingency deadline: any time, even after release. Any affected
packages would have to be rebuilt.
* Blocks release? No.

== Documentation ==
The invocation of the helper will be documented inline in the macros
files. Other documentation is not needed.

== Release Notes ==
Package builds automatically hardlink identical files. This reduces
the installation footprint a bit and also makes packages builds more
reproducible.



--
Aoife Moloney

Fedora Operations Architect

Fedora Project

Matrix: @amoloney:fedora.im

IRC: amoloney

--
_______________________________________________
devel-announce mailing list -- devel-announce@lists.fedoraproject.org
To unsubscribe send an email to devel-announce-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel-announce@lists.fedoraproject.org
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue

No comments:

Post a Comment