-
Notifications
You must be signed in to change notification settings - Fork 74
Description
Hi,
I sometimes have the need to archive hundreds of millions of small files.
Lots of software fails on that with out-of-memory, for example:
It would be fantastic if somewhere in the mkdwarfs
man page you could document its memory scaling behaviour.
For example:
- How does memory usage grow with the number of files, and their size?
- E.g. expected needed RAM per 1M involved files
- Are there options that can affect this?
- E.g. in some systems, you can turn off deduplication to get constant, streaming memory usage.
- Similar with hardlink detection.
- Are there flags that can speed up the process?
- For example, reading 100 files in parallel would drastically help on e.g. distributed, networked file systems over spinning disk, where a single IO might take 10 ms but it supports many of them in parallel.
--num-scanner-workers
looks like such an option, are there others recommended for the "many small files" use case?
- For example, reading 100 files in parallel would drastically help on e.g. distributed, networked file systems over spinning disk, where a single IO might take 10 ms but it supports many of them in parallel.
As a quick benchmark, 500 k small files took 2 GB maxresident RAM for me with default options and --num-scanner-workers 100
, on a 32-core machine.
--file-hash=none --max-similarity-size=0 --window-size 0 --memory-limit 100M
did not significantly reduce it, but maybe that changes at higher scale.
But it is definitely curious that the used memory was 20x higher than the requested memory limit; its documentation says approximately
, but this is a case that further motivates knowing what the other factors are in memory consumption.
It would be awesome if the scaling behaviour could be documented, so that one doesn't have to benchmark it to find out what would havppen for 500 M files.