Document memory behaviour and give tips for dealing with many files

Hi,

I sometimes have the need to archive hundreds of millions of small files.

Lots of software fails on that with out-of-memory, for example:

* https://github.com/plougher/squashfs-tools/issues/238

It would be fantastic if somewhere in the [`mkdwarfs` man page](https://github.com/mhx/dwarfs/blob/2cb5542a5d4274225c5933370adcf00035f6c974/doc/mkdwarfs.md) you could document its memory scaling behaviour.

For example:

* How does memory usage grow with the number of files, and their size?
  * E.g. expected needed RAM per 1M involved files
* Are there options that can affect this?
  * E.g. in some systems, you can turn off deduplication to get constant, streaming memory usage.
  * Similar with hardlink detection.
* Are there flags that can speed up the process?
  * For example, reading 100 files in parallel would drastically help on e.g. distributed, networked file systems over spinning disk, where a single IO might take 10 ms but it supports many of them in parallel. `--num-scanner-workers` looks like such an option, are there others recommended for the "many small files" use case?

As a quick benchmark, 500 k small files took 2 GB maxresident RAM for me with default options and `--num-scanner-workers 100`, on a 32-core machine.

`--file-hash=none --max-similarity-size=0 --window-size 0 --memory-limit 100M`did not significantly reduce it, but maybe that changes at higher scale.

But it is definitely curious that the used memory was 20x higher than the requested memory limit; its documentation says `approximately`, but this is a case that further motivates knowing what the other factors are in memory consumption.

It would be awesome if the scaling behaviour could be documented, so that one doesn't have to benchmark it to find out what would havppen for 500 M files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Document memory behaviour and give tips for dealing with many files #226

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Document memory behaviour and give tips for dealing with many files #226

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions