Skip to content

Document memory behaviour and give tips for dealing with many files #226

@nh2

Description

@nh2

Hi,

I sometimes have the need to archive hundreds of millions of small files.

Lots of software fails on that with out-of-memory, for example:

It would be fantastic if somewhere in the mkdwarfs man page you could document its memory scaling behaviour.

For example:

  • How does memory usage grow with the number of files, and their size?
    • E.g. expected needed RAM per 1M involved files
  • Are there options that can affect this?
    • E.g. in some systems, you can turn off deduplication to get constant, streaming memory usage.
    • Similar with hardlink detection.
  • Are there flags that can speed up the process?
    • For example, reading 100 files in parallel would drastically help on e.g. distributed, networked file systems over spinning disk, where a single IO might take 10 ms but it supports many of them in parallel. --num-scanner-workers looks like such an option, are there others recommended for the "many small files" use case?

As a quick benchmark, 500 k small files took 2 GB maxresident RAM for me with default options and --num-scanner-workers 100, on a 32-core machine.

--file-hash=none --max-similarity-size=0 --window-size 0 --memory-limit 100Mdid not significantly reduce it, but maybe that changes at higher scale.

But it is definitely curious that the used memory was 20x higher than the requested memory limit; its documentation says approximately, but this is a case that further motivates knowing what the other factors are in memory consumption.

It would be awesome if the scaling behaviour could be documented, so that one doesn't have to benchmark it to find out what would havppen for 500 M files.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions