Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
ML Experiments Management with Git (github.com/iterative)
112 points by shcheklein on Nov 2, 2023 | hide | past | favorite | 30 comments


Another option, that manages versioning of your computational graph and its results and provides extremely elegant query-able memoization is Mandala https://github.com/amakelov/mandala

It is a much simpler and much more magical piece of software that truly expanded how I think about writing, exploring, and experimenting with code. Even if you never use it, you probably would really enjoy reading the blog posts the author wrote about the design of the tool https://amakelov.github.io/blog/pl/



Wow. What a quality tool. Thanks for sharing!


Author here - thank you for the compliment! Always happy to answer any questions (the docs are admittedly quite sparse) and hear about what people like/dislike about the library!


One of the maintainers here. I published this link tbh to specifically emphasize the experiment management aspect of DVC. Historically because of its name (Data Version Control) users perceived it as a pure replacement for LFS scenarios, while in reality it always had pipelines, metrics, etc, etc.

I 100% agree that managing large datasets by moving them around is not practical, and definitely not in LFS/DVC-style. There should be a level of indirection if reproducibility is needed (pointers are versioned to files, not the data directly, data should be staying in the cloud).

Here, I would love to one more time mention some other cool features that DVC has. E.g. `dvc exp` set of commands where it is creating custom git refs to snapshot experiments, of DVCLive logger that helps capturing metrics, plots, etc. And also VS Code extension [1] that provides quite cool experience for experiments workflow inside VS Code.

Point here is that for DVC the ability to capture some large files and directories (that do not fit into Git) was always a low level mechanism to support higher level scenarios (e.g. you need to save a model somewhere as an output of an experiment).

[1] https://marketplace.visualstudio.com/items?itemName=Iterativ...


> I 100% agree that managing large datasets by moving them around is not practical, and definitely not in LFS/DVC-style. There should be a level of indirection if reproducibility is needed (pointers are versioned to files, not the data directly, data should be staying in the cloud).

I am not sure I understand that correctly. Are you saying that LFS/DVC manage the data suboptimally because they do not use some kind of pointer?

I only have some experience with DataLad[0], not with DVC or LFS. DataLad is built on git-annex, which does a pointer indirection through symlinks or pointer files in git. You basically manage the directory structure in git and can "get" and "drop" specific files as you need them. git-annex keeps track of where (e.g. on what (remote) system, which could be anything from a http server over s3 to a nextcloud via webdav and more) the data is and how it can be fetched. I always thought DVC did something similar.

[0] https://www.datalad.org/


I work at XetHub and we're taking a different approach here to managing ML experiments in git.

Instead of trying to store data and ML models in one place (like S3) and code, models & documentation in another place (like GitHub), we are scaling Git so you can version everything in a single system. You can just use git and you don't need to learn a new tool or set of commands.

This way, you can start with a simple experiment tracking approach of folders inside the same branch and then evolve gradually to multiple branches with long running experiments.

We're about to release our Github integration, so data & ML teams can take advantage of this inside their existing Github repos. If anyone wants a tour or wants to chat, my email's in my HN profile.

If you're curious about our tech:

- Here's an example 3.3 TB Git repo: https://xethub.com/XetHub/RedPajama-Data-1T

- We wrote a paper on our solution to scale git to 100 terabytes: https://about.xethub.com/blog/git-is-for-data-published-in-c...

- We created a Rust library to mount large repos to machines with limited storage space: https://news.ycombinator.com/item?id=37573679


I have used dvc (specifically pipelines and experiments) for a little while now and I have found it to be a great tool for creating a standardized process for training ML models. Workflows in so many different teams consist of a bunch of notebooks that aren't versioned, that are all on developers local machines, and just no reproducibility or standardization. DVC is a great lightweight tool that is easy to setup and use, customizable to whatever hardware or architecture that you are using. Most teams that I have seen have data and models on local machines, and do not version them whatsoever. DVC has been great for creating reproducible models, which has always been the biggest focus point for me. Overall I think it is a great tool and does a whole lot more than just data version control, things like experiments and DVCLive are super great.


I gave up on dvc and instead switched to huggingface and wandb because of the way it handled large files and large local cache it downloaded.


I haven't used it but git seems like the wrong framework for managing experiments. Even programmers don't especially like git, forget about data scientists. You can't meaningfully version control the data; it's usually binary, and in an object store. The code already lives in a git repository. So what does that leave, the metadata around the experiment?

A relational database like Dolt seems like a better fit. You want to be able to query by experiment name, date, test results, and other metadata.

Let me know if I'm doing it wrong! What's your use case?


> You can't meaningfully version control the data; it's usually binary, and in an object store. The code already lives in a git repository. So what does that leave, the metadata around the experiment?

Out of the box you can't. We're taking a different approach at work (XetHub). GitHub sees pointer files but we embed rendered views and diffs (supporting more file types incrementally) using a Github app from our service.


DVC stores metadata files in your git repo and the actual files somewhere else (local, S3, etc). It handles swapping around the files so they match what you've checked out. It does a lot more but that's the relevant part for your main question.


Pachyderm is another alternative. Here's a couple articles that compare the two:

https://www.pachyderm.com/blog/data-versioning-comparing-dvc... https://www.dolthub.com/blog/2022-04-27-data-version-control...

As far as I know, DVC is better than Pachyderm for small datasets, but Pachyderm scales way better


I built an ML pipeline in pachyderm years ago, maybe 2016, and loved using it then. But my data was small-to-medium, and I stuck with CSVs because back then you couldn't get many of the benefits with a binary format.

I think it's a shame it took Airflow and similar several more years to realize "each step is a docker container" is the right way to build a dag. It's not clear to me why Pachyderm was left behind while Prefect and Dagster became serious contenders, and Airflow/Astronomer started recommending everyone use it just like Pachyderm (container per step).


I have a similar harness going for my recent experiments, except instead of hosting with huggingface I have a dataframe with pointers to the files on S3 and then just download them during local preprocessing.

Every time I see DVC mentioned I always feel like the idea was so close (and perhaps right in intuition to use git for everything) but the execution had just enough friction that I looked elsewhere. Small DX improvements really do cascade pretty far.


DVC is great for medium-scale projects in small teams, but that's where I'd stop with it. It only really makes sense for work that you're doing on your own machine, or an old-school Linux server type of setup, not something you'd use for modern-day ML work in a cloud environment.

Also I always thought the idea of using Git branches to track experiments was a bad idea. I would never want to only have one experiment "active" at a time. Even if I'm only running one process at a time, I still want to be able to look at outputs and such all side-by-side. Maybe there's some magic tooling they created that makes it workable.


FYI, you can use git worktrees [1] to work on multiple branches simultaneously

[1] https://git-scm.com/docs/git-worktree


Yeah, I know and love that feature for software projects, especially if I need to switch over to a bugfix while I'm deep in a topic branch.

But for a data project it would be a big pain to have separate worktrees just to work around what IMO is a usage anti-pattern to begin with!


DVC has `dvc exp` that doesn't require creating commits or branches. It's utilizing git custom references (technical details [1]). And it can be visualized in CLI or VS Code.

[1] https://iterative.ai/blog/experiment-refs

[2] https://marketplace.visualstudio.com/items?itemName=Iterativ...


Thanks! I've been using DVC solely for tracking data, and had basically ignored all of its other features.

I'll have to take a look at this. Most/all of my projects use small or medium scale data, and I consider DVC indispensable for tracking data therein. I wouldn't mind having a good system for tracking experiment results, although admittedly I find that a spreadsheet or text file does a pretty good job for what I need to do.


I've really liked the idea of scidataflow in this context: https://github.com/vsbuffalo/scidataflow

It's neat for research as it stores the data on scientific data repositories like Zenodo and you get DOIs.


How do you track code-changes between model iterations with this setup though? From looking at wandb, it seems like it does something similar to MLFlow, so it only logs meta-parameters right?


It can log anything you want model config and evaluation metrics along with git commit hash of the code that was used. You can also put pointers to datasets. Easy to compare experiments and choose a model or repeat experiments.


which things do you do with HF, and which with WandB?

HF seems like "GitHub for models and datasets", it has a cool brand and everyone in ML uses it in some capacity. But when it comes to _company_ needs like private datasets/models, experiment tracking, CI integrations, etc. it seems WandB is a superset of HF.

HF has an enterprise offering, but it seems to be de-prioritized, and I think you'd still need WandB or MLFlow for experiment tracking?


Exactly! Ironically, dvc doesn't really scale well and is limited to small-ish files.


hey, sorry to hear that. Could you share more details please? Were you using specifically for data management?


I'm using DVC for managing experiments as well as data versioning for 100,000s of files. Its git-like interface is great but it does have scaling issues, especially with hooks taking tens of minutes on every commit. It also does not support parallel stage execution yet.


Use Determined if you want a nice UI https://github.com/determined-ai/determined#readme


I'm a developer advocate at Determined AI - here to help anyone who's interested in trying us out. There's a slack community as well, link is on the homepage: https://www.determined.ai/


You still use git as usual to manage your code versions, but Determined has a nice Web UI to track experiments you submit with that code, just to clarify




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: