Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: Adding a lot of files to MFS will slow ipfs down significantly #8694

Closed
3 tasks done
RubenKelevra opened this issue Jan 22, 2022 · 44 comments · Fixed by #10630
Closed
3 tasks done

Regression: Adding a lot of files to MFS will slow ipfs down significantly #8694

RubenKelevra opened this issue Jan 22, 2022 · 44 comments · Fixed by #10630
Labels
kind/bug A bug in existing code (including security flaws) need/analysis Needs further analysis before proceeding P1 High: Likely tackled by core team if no one steps up topic/MFS Topic MFS topic/sharding Topic about Sharding (HAMT etc)

Comments

@RubenKelevra
Copy link
Contributor

RubenKelevra commented Jan 22, 2022

Checklist

Installation method

built from source

Version

go-ipfs version: 0.13.0-dev-2a871ef01
Repo version: 12
System version: amd64/linux
Golang version: go1.17.6

Config

{
  "API": {
    "HTTPHeaders": {}
  },
  "Addresses": {
    "API": "/ip4/127.0.0.1/tcp/5001",
    "Announce": [],
    "AppendAnnounce": null,
    "Gateway": "/ip4/127.0.0.1/tcp/80",
    "NoAnnounce": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.0.0/ipcidr/29",
      "/ip4/192.0.0.8/ipcidr/32",
      "/ip4/192.0.0.170/ipcidr/32",
      "/ip4/192.0.0.171/ipcidr/32",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "Swarm": [
      "/ip4/0.0.0.0/tcp/443",
      "/ip6/::/tcp/443",
      "/ip4/0.0.0.0/udp/443/quic",
      "/ip6/::/udp/443/quic"
    ]
  },
  "AutoNAT": {},
  "Bootstrap": [
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
    "/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/ip4/104.131.131.82/udp/4001/quic/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ"
  ],
  "DNS": {
    "Resolvers": null
  },
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/2",
            "sync": false,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "500GB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": false,
      "Interval": 10
    }
  },
  "Experimental": {
    "AcceleratedDHTClient": false,
    "FilestoreEnabled": false,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "P2pHttpProxy": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  },
  "Gateway": {
    "APICommands": [],
    "HTTPHeaders": {
      "Access-Control-Allow-Headers": [
        "X-Requested-With",
        "Range",
        "User-Agent"
      ],
      "Access-Control-Allow-Methods": [
        "GET"
      ],
      "Access-Control-Allow-Origin": [
        "*"
      ]
    },
    "NoDNSLink": false,
    "NoFetch": false,
    "PathPrefixes": [],
    "PublicGateways": null,
    "RootRedirect": "",
    "Writable": false
  },
  "Identity": {
    "PeerID": "xxx"
  },
  "Internal": {},
  "Ipns": {
    "RecordLifetime": "96h",
    "RepublishPeriod": "",
    "ResolveCacheSize": 2048
  },
  "Migration": {
    "DownloadSources": null,
    "Keep": ""
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Peering": {
    "Peers": null
  },
  "Pinning": {},
  "Plugins": {
    "Plugins": null
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Router": "gossipsub"
  },
  "Reprovider": {
    "Interval": "12h",
    "Strategy": "all"
  },
  "Routing": {
    "Type": "dhtserver"
  },
  "Swarm": {
    "AddrFilters": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.0.0/ipcidr/29",
      "/ip4/192.0.0.8/ipcidr/32",
      "/ip4/192.0.0.170/ipcidr/32",
      "/ip4/192.0.0.171/ipcidr/32",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "ConnMgr": {
      "GracePeriod": "3m",
      "HighWater": 700,
      "LowWater": 500,
      "Type": "basic"
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": true,
    "RelayClient": {},
    "RelayService": {},
    "Transports": {
      "Multiplexers": {},
      "Network": {
        "QUIC": false
      },
      "Security": {}
    }
  }
}

Description

I'm running 2a871ef compiled by go 1.17.6 on Arch Linux for some days on one of my servers.

I had trouble with my MFS datastore after updating (I couldn't delete a file). So I reset my datastore and started importing the data again.

I'm using a shell script that adds the files and folders individually. Because of #7532, I can't use ipfs files write but instead use ipfs add, followed by an ipfs files cp /ipfs/$cid /path/to/file and an ipfs pin rm $cid.

For the ipfs add is set size-65536 as the chunker, blake2b-256 as the hashing algorithm, and use raw-leaves.


After the 3 days, there was basically no IO on the machine and ipfs was using around 1.6 cores pretty consistently without any progress real progress. At that time only this one script was running against the API with no concurrency. The automatic garbage collector of ipfs is off.

There are no experimental settings activated and I'm using flatfs.

I did some debugging, all operations were still working, just extremely slow:

$ time /usr/sbin/ipfs --api=/ip4/127.0.0.1/tcp/5001 files stat --hash --offline /x86-64.archlinux.pkg.pacman.store/community
bafybeianfwoujqfauris6eci6nclgng72jttdp5xtyeygmkivzyss4xhum

real	0m59.164s
user	0m0.299s
sys	0m0.042s

and

$ time /usr/sbin/ipfs --api=/ip4/127.0.0.1/tcp/5001 files stat --hash --offline --with-local /x86-64.archlinux.pkg.pacman.store/community
bafybeie5kkzcg6ftmppbuauy3tgtx2f4gyp7nhfdfsveca7loopufbijxu
Local: 20 GB of 20 GB (100.00%)

real	4m55.298s
user	0m0.378s
sys	0m0.031s

This is while my script was still running on the API and waiting minutes on each response.

Here's my memory dump etc. while the issue occurred: /ipfs/QmPJ1ec2CywWLFeaHFaTeo6g56S5Bqi3g3MEF1a3JrL8zk

Here's a dump after I stopped the import of files and the CPU usage dropped down to like 0.3 cores: /ipfs/QmbotJhgzc2SBxuvGA9dsCFLbxd836QBNFYkLhdqTCZwrP

Here's what the memory looked like as the issue occurred (according to atop 1):

MEM
tot    31.4G
free    6.6G
cache   1.1G
dirty   0.1M
buff   48.9M
slab    7.1G
slrec   3.7G
shmem   2.0M
shrss   0.0M
vmbal   0.0M
zfarc  15.6G
hptot   0.0M
hpuse   0.0M

The machine got 10 dedicated cores from a AMD EPYC 7702 and 1 TB SSD storage via NAS.

@RubenKelevra RubenKelevra added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Jan 22, 2022
@RubenKelevra
Copy link
Contributor Author

The shellscript I'm using is open source, so you should be able to reproduce this:

git clone https://github.com/RubenKelevra/rsync2ipfs-cluster.git rsync2ipfs-cluster
cd rsync2ipfs-cluster
git reset --hard 1fd9712371f0315a35a80e9680340655ba751d7a
bash bin/rsync2cluster.sh --create --arch-config

This will rsync the arch package mirror and loop over the files and import this into the local MFS.

Just make sure you have enough space in ~ for the download (69GB) and on the IPFS node to write this into the storage.

@aschmahmann
Copy link
Contributor

@RubenKelevra do you know which version caused a regression? Have you tried with v0.11.0? v0.12.0 is a very targeted release which should not have disturbed much so understanding when this issue emerged would be very helpful.

@RubenKelevra
Copy link
Contributor Author

Hey @aschmahmann, I started the import on 0.11 yesterday. As soon as I'm home I can report if this is happening there too.

While an offline import works without slowdown, I still get sometimes errors back which looks like the ipfs add comes too fast back and the ipfs files cp * command can't yet access the CID.

This seems to be a dedicated issue which is probably not a regression, as I never tried importing it off line before.

@RubenKelevra
Copy link
Contributor Author

I can confirm this issue for 0.11 as well, so it's not a new thing.

$ ipfs version --all
go-ipfs version: 0.11.0-67220edaa
Repo version: 11
System version: amd64/linux
Golang version: go1.17.6

The next step for me is to try the binary from dist.ipfs.io to rule out any build issues.

@RubenKelevra
Copy link
Contributor Author

The next step for me is to try the binary from dist.ipfs.io to rule out any build issues.

I can confirm the issue for the binary from dist.ipfs.io as well.

@aschmahmann
Copy link
Contributor

aschmahmann commented Jan 30, 2022

I can confirm this issue for 0.11 as well, so it's not a new thing.

Thanks that's very helpful. Is this a v0.10.0 -> v0.11.0 thing? When was the last known version before the behavior started changing? In any event, having a more minimal reproduction would help (e.g. making a version of the script that works from a local folder rather than relying on rsync).

If this is v0.11.0 related then my suspicion is that you have directories that were small enough you could transfer them through go-ipfs previously, but large enough that MFS will now automatically shard them (could be confirmed by looking at your MFS via ipfs dag get and seeing if you have links like FF in your directories). IIRC I saw some HAMT checks in your profile dump which would support this.

If so then what exactly about sharded directories + MFS is causing the slow down should be looked at. Some things I'd start with investigating are:

  • The modifications of the sharded directories are more expensive for repeated MFS updates
    • Since you have to modify multiple blocks at a time
    • The limit checks for automatic sharding/unsharding are too expensive for repeated MFS modifications
    • Bulking up writes and flushing would likely help here, although if going down this road I'd be careful. My suspicion is that MFS flush has not been extensively tested and probably even more so with sharded directories

@RubenKelevra
Copy link
Contributor Author

Thanks that's very helpful. Is this a v0.10.0 -> v0.11.0 thing? When was the last known version before the behavior started changing?

I think the last time I ran a full import I was on 0.9.1.

I just started the import to make sure that's correct.

In any event, having a more minimal reproduction would help (e.g. making a version of the script that works from a local folder rather than relying on rsync).

Sure, if you want to avoid any rsync, just comment out L87. I think that should work.

The script will still expect a repository directory like from Manjaro or Arch to work properly, but you can just reuse the same repository without having to update it between each try.

If so then what exactly about sharded directories + MFS is causing the slow down should be looked at. Some things I'd start with investigating are:

  • The modifications of the sharded directories are more expensive for repeated MFS updates

    • Since you have to modify multiple blocks at a time
    • The limit checks for automatic sharding/unsharding are too expensive for repeated MFS modifications
    • Bulking up writes and flushing would likely help here, although if going down this road I'd be careful. My suspicion is that MFS flush has not been extensively tested and probably even more so with sharded directories

Sounds like a reasonable suspicion, but on the other hand, this shouldn't lead to minutes in response time for simple operations.

I feel like we're dealing with some kind of locked operation which gets "overwritten" with new data fed into ipfs, while it's running. So we pile up tasks before a lock.

This would explain why it's starting fast and get slower and slower until it's basically down to a crawl.

@RubenKelevra
Copy link
Contributor Author

RubenKelevra commented Feb 5, 2022

Ah and additionally, I used sharding previously just for testing, but decided against it. So the import was running fine with sharding previously (like with 0.4 or something).

Previously, there was no need for sharding, which makes me wonder why IPFS would do sharding if it's not necessary.

@RubenKelevra
Copy link
Contributor Author

@aschmahmann I've installed 0.9.1 from dist.ipfs.io and I can confirm, the bug is not present in this version.

@aschmahmann
Copy link
Contributor

Ok, so to clarify your performance/testing looks like:

  • v0.12.0-rc1 ❌
  • v0.11.0 ❌ (includes automatic UnixFS sharding)
  • v0.10.0 ❓ (includes a bunch of code moving to the newer IPLD libraries)
  • v0.9.1 ✔️

Previously there was no need for sharding, which makes me wonder why IPFS would do sharding if it's not necessary.

TLDR: Two reasons. 1) Serializing the block to check if it exceeds the limit before re-encoding it is expensive, so having some conservative estimate is reasonable 2) Maxing out the block size isn't necessarily optimal. For example, if you keep writing blocks up to 1MB in size then every time you add an entry you create a duplicate block of similar size which can lead to a whole bunch of wasted space that you may/may not want to GC depending on how accessible you want your history to be. #7022 (comment)


Thanks for your testing work so far. If you're able to keep going here, understanding if v0.10.0 is ✔️ or ❌ would be helpful. Additionally/alternatively, you could try v0.11.0 and jack up the internal variable controlling the auto-sharding threshold to effectively turn it off by doing ipfs config --json Internal.UnixFSShardingSizeThreshold "\"1GB\"" (1GB is obviously huge and will create blocks too big to transfer, but will make it easy to identify if this is what's causing the performance issue).

I also realized this internal flag was missing from the docs 🤦 so I put up #8723

@BigLep BigLep moved this to 🥞 Todo in IPFS Shipyard Team Mar 2, 2022
@BigLep BigLep removed the status in IPFS Shipyard Team Mar 2, 2022
@BigLep BigLep removed the need/triage Needs initial labeling and prioritization label Mar 3, 2022
@BigLep BigLep added the need/author-input Needs input from the original author label Mar 3, 2022
@BigLep
Copy link
Contributor

BigLep commented Mar 25, 2022

We're going to close because don't have additional info to dig in further. Feel free to reopen with the requested info if this is still an issue. Thanks.

@BigLep BigLep closed this as completed Mar 25, 2022
@RubenKelevra
Copy link
Contributor Author

@aschmahmann was this fixed? I updated to the 0.13rc1, and I ran into serious performance issues again.

Have you tried to add many files to the MFS in simple ipfs add/ipfs files cp/ipfs pin remove or tried my script, yet?

@RubenKelevra

This comment was marked as resolved.

@RubenKelevra

This comment was marked as off-topic.

@RubenKelevra
Copy link
Contributor Author

@aschmahmann I set ipfs config --json Internal.UnixFSShardingSizeThreshold "\"1MB\"" so 1MB not 1GB, since this should work in theory.

But I still see 30 second delays for removing a single file in the MFS.

I think this was more due to large repining operations by the cluster daemon, as the MFS folders need to be pinned locally on every change.

I created a ticket on the cluster project for this.

Furthermore, I see (at least with a few file changes) no large hangs when using 1 MB sharding.

But I haven't yet tested the full import I had originally trouble with – and what this ticket is about.

@RubenKelevra
Copy link
Contributor Author

RubenKelevra commented Jun 5, 2022

@aschmahmann I can confirm this issue with the suggested ipfs config --json Internal.UnixFSShardingSizeThreshold "\"1MB\"" with a current master (a72753b) as well as the current stable (0.12.2) from the ArchLinux repo.

(1 MB should never be exceeded on my datasets, as sharding wasn't necessary before to store the folders.)

The changes to the MFS are crunching to a hold after a lot of consecutive operations, where single ipfs files cp /ipfs/$CID /path/to/file commands take 1-2 minutes while the IPFS daemon is taking 4-6 cores worth of CPU power.

All other MFS operations are blocked as well, so you get response times in the minutes for simple ls operations.

@BigLep please reopen as this isn't fixed and can be reproduced

@RubenKelevra
Copy link
Contributor Author

I'll take my project pacman.store with the package mirrors for Manjaro, Arch Linux etc. down until this is solved. I don't like running 0.9 anymore due to the age and would need to downgrade the whole server again.

I just cannot share days old packages, even weeks due to safety concerns, so I don't like to do any harm here.

The URLs will just return empty directories for now.

@lidel

This comment was marked as off-topic.

@lidel lidel added P1 High: Likely tackled by core team if no one steps up topic/MFS Topic MFS topic/sharding Topic about Sharding (HAMT etc) labels Jun 27, 2022
@lidel lidel moved this to 🥞 Todo in IPFS Shipyard Team Jun 27, 2022
@lidel lidel added the need/analysis Needs further analysis before proceeding label Jun 27, 2022
@RubenKelevra
Copy link
Contributor Author

RubenKelevra commented Jun 27, 2022

@RubenKelevra you mean it is still broken, even after the switch to go-unixfs v0.4.0?

Yeah. CPU load piles up and a simple ipfs files cp /ipfs/$CID /path/in/mfs takes minutes to complete.

I think there's just something running in concurrency and somehow work need to be done again and again to make the change to the MFS, as other parts of it are still changing. But that's just a guess. Could be anything else, really.

@schomatis
Copy link
Contributor

Asked them for specific commands that are executed just before the crash

@lidel We might be conflating different topics in the same issue here, let's have a new issue for that report when it comes and please ping me.

@lidel
Copy link
Member

lidel commented Jun 27, 2022

@schomatis ack, moved panic investigation to #9063

@dhyaniarun1993

This comment was marked as off-topic.

@Jorropo
Copy link
Contributor

Jorropo commented Jul 25, 2022

@dhyaniarun1993 I am confident that this is an other issue (I couldn't find the issue so if you want open a new one even tho we know what this is).
ipfs ls fetches block one by one, so it's a sequential process to read all the 40k files.

@dhyaniarun1993

This comment was marked as off-topic.

@Jorropo
Copy link
Contributor

Jorropo commented Jul 25, 2022

@dhyaniarun1993 I don't want to spam this issue so I'll mark our conversation off-topic FYI, pls open a new issue.

The resolution (walking from / to /xyz is fine enough).
The issue is:

/xyz/0
/xyz/1
/xyz/2
/xyz/3
/xyz/4
/xyz/5

Kubo will fetch 0, 1, 2, 3, 4 and 5 one-by-one (instead of 32 by 32 in parallel for example).

EDIT: github wont let me hide my own messages ... :'(

@CMB
Copy link

CMB commented Aug 4, 2022

I have this same issue. I'm maintaining a package mirror with approximately
400000 files. ipfs files cp gets progressively slower
as files are added to MFS.

Here's the question. I already have a table of names and cids of
all of the files. I track them in a database.
Is there a way I could create the directory structure in one fell swoop
from my list, rather than adding links to MFS one at a time?

@BigLep
Copy link
Contributor

BigLep commented Sep 1, 2022

@Jorropo : what are the next steps here?

@BigLep BigLep assigned Jorropo and unassigned aschmahmann Nov 17, 2022
@Jorropo Jorropo removed their assignment Mar 4, 2024
@hsanjuan
Copy link
Contributor

Looking at this:

  • I created 50k files of 64KB
  • ipfs --offline daemon
  • ipfs add $file; ipfs mfs cp /ipfs/$cid /mfs/. I'm measuring how long the mfs part takes. I understand this should show the issue.
  • At ~2000 files we are relatively stable at ~80ms.
  • At ~4000 files still stable at ~80ms.
  • At some very concrete point at around ~6000 it has jumped to ~130ms
  • Soon after mfs cp starts taking 375 seconds.
  • CPU usage is at 200%
  • Only one table file in datastore folder.

heap from pprof:

image

cpu from pprof:

image

block:

image

allocations:

image

It seems to me that this is related to sharding. Funny enough my computer froze while writing this and I lost the profile data which I had in a ram folder and it also showed a bunch of goroutines waiting on a leveldb select().

@RubenKelevra
Copy link
Contributor Author

@hsanjuan Thanks for taking look at this!

Would be cool to bring the mirror back online!

@hsanjuan
Copy link
Contributor

Running same test with pebbleds as backend instead of leveldb:

  • At ~4000 we are stable at ~60ms per ipfs files cp
  • At ~6000 things start taking ~95ms and then it hangs for a 30 seconds on specific nodes, then continues
  • After a bit has continued normally. At ~12400 it takes ~100ms per entry.
  • At ~14200 I observe again some slow nodes. I added some logging and these correspond to calls to unixfs.AddChild(), which is being called with a large number of files in the mfs folder. This happens every few nodes and it seems it calls AddChild with more and more until it takes minutes....

And OK: this is the issue:

  • MFS has a caching layer that caches every node that is added.
  • there is a PinMFS service that runs every 30 seconds to pin mfs to remote services
  • Even when none configured, it calls FilesRoot.GetDirectory().GetNode() to get the mfs root.
  • GetNode() calls directory.Sync() which ranges all entries in an internal MFS cache to update the underlying mfs structure.
  • We get to call AddChild() with many thousands of entries (all of the entries in the cache) https://github.com/ipfs/boxo/blob/7c459afd9e4b8c33780038bf08652b0f5a7d9c43/mfs/dir.go#L130
  • Meanwhile AddChilds() of the ipfs files cp process are locked.
  • The cache is never emptied: https://github.com/ipfs/boxo/blob/7c459afd9e4b8c33780038bf08652b0f5a7d9c43/mfs/dir.go#L385 . This likely leaks memory. IIRC it could be even caching the actual contents of the nodes. I don't notice it so much because all my files are not much.
  • Things work OK after reboot because the cache becomes empty then. So then you have 30 seconds to add things before the first sweep, and so on, until the sweep ends up taking 30 seconds itself and it becomes impossible to add.

I will work on a fix tomorrow, but doing export MFS_PIN_POLL_INTERVAL=99999999m should fix it.

hsanjuan added a commit that referenced this issue Dec 18, 2024
This is a mitigation to increased MFS memory usage in the course of many writes operations.

The underlying issue is the unbounded growth of the mfs directory cache in
boxo. In the latest boxo version, this cache can be cleared by calling Flush()
on the folder. In order to trigger that, we call Flush() on the parent folder
of the file/folder where the write-operations are happening.

To flushing the parent folder allows it to grow unbounded. Then, any read
operation to that folder or parents (i.e. stat), will trigger a sync-operation to match
the cache to the underlying unixfs structure (and obtain the correct node-cid).

This sync operation must visit every item in the cache. When the cache has grown too much,
and the underlying unixfs-folder has switched into a HAMT, the operation can take minutes.

Thus, we should clear the cache often and the Flush flag is a good indicator
that we can let it go. Users can always run with --flush=false and flush at
regular intervals during their MFS writes if they want to extract some performance.

Fixes #8694, #10588.
@hsanjuan
Copy link
Contributor

Ok, summary for everyone looking here:

  • The problem: adding many files to a single MFS folder causes growth of memory and generally makes MFS super slow. Deadlock on write operations eventually.

  • The causes:

      1. MFS directories have an in-memory cache. This grows with every item added. Every time the directory node itself is read, the cache is "synced" so that we can read the right unixfs data, with the right CID, but the cache is not cleared. Writing many nodes to a folder results in keeping them in-memory, and also in slowing down reads due to these "sync" operations.
      1. There is a worker to pin the MFS root regularly to remote pinning services. This triggers reading the directory-node, which triggers the sync every 30 seconds. If reading the directory node takes more than 30 seconds, the next iteration piles up. While reading the node, no writes can happen -> deadlocks.
      1. An additional contributed is unixfs HAMT for large directories. At ~5800 entries, a directory is converted from a "basic directory" (a node with links), into a HAMT directory. Adding entries to a HAMT, reading it etc, is about 2x slower at the beginning, and it will grow consistently as there are more branches etc (sometimes going up and sometimes going down depending on how the shape of the hamt is forming).
  • The next release will address:

The HAMT slowdown part is something to keep in mind when working with MFS folders with a large number of files in them, but it is not a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) need/analysis Needs further analysis before proceeding P1 High: Likely tackled by core team if no one steps up topic/MFS Topic MFS topic/sharding Topic about Sharding (HAMT etc)
Projects
No open projects
Status: 🥞 Todo
Development

Successfully merging a pull request may close this issue.

9 participants