-
Notifications
You must be signed in to change notification settings - Fork 170
Description
🚀 The feature
At some point, there were a InMemoryCacheHolder datapipe. However, this has been removed from the new node design.
This would be very useful for some expensive parts of the DAG that would gain from being stored in memory rather than recomputed each time.
Motivation, pitch
Some transforms are quite expensive, and I would like to avoid needing to repeat them at each epoch. Therefore, it would be handy to have some cache mechanism that would allow skipping expensive parts of the DAG if they have been computed before. The user could have a choice to cache on memory or on the disk.
However, I'm not sure what the interface would look like. I feel like there would be 2 nodes needed, sharing the cache:
- One at the start of the DAG branch to skip (that would check if passing through the branch is needed)
- One at the end of the branch (that would store the result of the branch for it to be used later)
I can't really think of another way to make this work, as you can't have just the first one (or else how do you store the result of the computation at the end of the branch?), and you can't have just the last one (bc how do you determine if the item have been cached or not?).
As far as I understand nodes, they are executed in a bottom-up manner, with the last node requiring the result of the previous node, itself requiring the result of the previous one, all the way up to the first node. However, this design makes it difficult to deal with a cache as you need to decide which branch to take from the bottom. This would be easier with a top-down design, with the data coming from the first node, up to the entrance of the cache, which would be able to make a decision on the branch to choose to continue.
Maybe having a some sort of CacheWrapper that would wrap a single node would be the solution? But then it would be cumbersome to cache entire branches of the DAG.