I found that bit slightly ironic because it always seems to produce slightly cringy Go code for me that might get the job done but skips over some of the usual design philosophies like use of interfaces, channels, and context. But for many parts, yeah, I’ve been very satisfied with Go code gen.
of course. it is not there yet. same happens for me. AIs do not get full project view nor dynamics of classes, behavior, domains... probably that is soon coming
for me it works well for small scope, isolated sub system or trivial code. unit tests, "given this example: A -> B, complete C -> ?" style transformation of classes (e.g. repositories, caches, etc.)
That’s a great question. Diffing is one area we’ve thought a bit about but still need to dedicate more cycles to. One thing I would be curious about is, what are you doing in these notebooks to check? For what it’s worth, could possibly have an intermediate Python model that does some calculation to look at differences and materializes the results to a table, which you could then query directly for further insight.
One thing we do have support for “expectations” — model-like Python steps that check data quality, and can flag it if the pipeline violates them.
> For what it’s worth, could possibly have an intermediate Python model that...materializes the results to a table
I think this is kind of the answer I was looking for, and in other systems I've actually manually implemented things like this with a "temp materialize" operator that's enabled by a "debug_run=True" flag. With the NB thing basically I'm trying to "step inside" the data pipeline, like how an IDE debugger might run a script line by line until an error hits and then drops you into a REPL located 'within' the code state. In the notebook I'll typically try to replicate (as close as possible) the state of the data inside some intermediate step, and will then manually mutate the pipeline between the original and branch versions to determine how the pipeline changes relate to the data changes. I think the dream for me would be to having something that can say "the delta on line N is responsible for X% of the variance in the output", although I recognize that's probably not a well defined calculation in many cases. But either way at a high level my goal is to understand why my data changes, so I can be confident that those changes are legit and not an artifact of some error in the pipeline.
Asserting that a set of expectations is met at multiple pipeline stages also gets pretty close, although I do think it's not entirely the same. Seems loosely analogous to the difference between unit and integration/E2E tests. Obviously I'm not going to land something with failing unit tests, but even if tests are passing the delta may include more subtle (logical) changes which violate the assumptions of my users or integrated systems (ex. that their understanding the business logic is aligned with what was implemented in the pipeline).
"In the notebook I'll typically try to replicate (as close as possible) the state of the data inside some intermediate step, and will then manually mutate the pipeline between the original and branch versions to determine how the pipeline changes relate to the data changes."
You can automate many changes / tests by materializing the parent(s) of the target table, and use the SDK to produce variations of a pipeline programmatically. If your pipeline has a free parameter (say top-k=5 for some algos), you could just write a Python for loop, and do something like:
client.create_branch()
client.run()
for each variation, materializing k versions at the end that you can inspect (client.query("SELECT MAX ...")
The broader concept is that every operation in the lake is immutably stored with an ID, so every run can be replicated with the exact same data sources and the exact same code (even if not committed to GHub), which also means you can run the same code varying the data source, or run a different code on the same data: all zero-copy, all in production.
As for the semantics of merge and other conflicts, we will be publishing by end of summer some new research: look out for a new blog post and paper if you like this space!
Wow that's super cool, thanks for explaining! I will definitely keep an eye out; having that level of certifiability and replayability at a pipeline level is something my team has been having too much angst about. It would be incredible to be able to abstract it away as cleanly as we do code management.
Awesome, you can write me anytime to geek out (jacopo.tagliabue@bauplanlabs.com) or follow us for more community sharing (papers, deep tech blog posts: https://www.linkedin.com/in/jacopotagliabue/).
The full auditability is already here today though, so we are always happy to hear your feedback on our public sandbox (you can join for free from our website, and reach out to us at anytime for good and bad feedback!)
The code you execute on your data currently runs in a per-customer AWS account managed by us. We leave the door open for BYOC based on the architecture we’ve designed, but due to lean startup life, that’s not an option yet. We’d definitely be down to chat about it
1. Great Python support. Piping something from a structured data catalog into Python is trivial, and so is persisting results. With materialization, you never need to recompute something in Python twice if you don’t want to — you can store it in your data catalog forever.
Also, you can request anything Python package you want, and even have different Python versions and packages in different workflow steps.
2. Catalog integration. Safely make changes and run experiments in branches.
3. Efficient caching and data re-use. We do a ton of tricks behind to scenes to avoid recomputing or rescanning things that have already been done, and pass data between steps with Arrow zero copy tables. This means your DAGs run a lot faster because the amount of time spent shuffling bytes around is minimal.
RE: workflow orchestrators. You can use the Bauplan SDK to query, launch jobs and get results from within your existing platform, we don’t want to replace entirely if it’s doesn’t fit for you, just to augment.
RE: DuckDB and Polars. It literally uses DuckDB under the hood but with two huge upgrades: one, we plug into your data catalog for really efficient scanning even on massive data lake houses, before it hits the DuckDB step. Two, we do efficient data caching. Query results and intermediate scans and stuff can be reused across runs.
As for Polars, you can use Polars itself within your Python models easily by specifying it in a pip decorator. We install all requested packages within Python modules.
> or is this a platform that maybe runs on k8s and provides its own serverless compute resources?
This one, although it’s a custom orchestration system, not Kubernetes. (there are some similarities but our system is really optimized for data workloads)
We manage Iceberg for easy data versioning, take care of data caching and Python modules, etc., and you just write some Python and SQL and exec it over your data catalog without having to worry about Docker and all infra stuff.
> In the end I kind of understand it as similar to sqlmesh, but with a "BYO compute" approach? So where sqlmesh wants to run on a Data Warehouse platform that provides compute, and only really supports Iceberg via Trino, bauplan is focused solely on Iceberg and defining/providing your own compute resources?
Philosophically, yes. In practice so far we manage the machines in separate AWS accounts _for_ the customers, in a sort of hybrid approach, but the idea is not dissimilar.
> Should I understand therefore that this is only usable with an account from bauplanlabs.com ?
Yep. We’d help you get started and use our demo team. Send jacopo.tagliabue@bauplanlabs.com an email
RE: pricing. Good question. Early startup stage bespoke at the moment. Contact your friendly neighborhood Bauplan founder to learn more :)
Wrt deployment: the system has a control plane (on Bauplan AWS, never see any data, just auth and metadata), and data planes for customers (single tenant, private link, Soc2 compliant and all that).
If by hosting you mean "move the data plane to my cloud", that is entirely possible but not as recommended as the managed offering: in the end, the only dependency we have are off-the-shelf VMs in which we install our binary - and your bucket of course, but that is yours.
If you mean "installing the control plane on my cloud", that is not in the cards at the moment, unless a very special deployment is needed.
My suggestion - before complex deployment discussion - is always super simple: try it for free on public datasets and decide if you like it; running the quick start takes three minutes, just send over your email for access.
If you do like it, we can have a discussion on deployment, which has never been a blocker before.
so like the Lambda funcs in the examples - do I deploy those myself to my own infra? or they have to be defined using Serverless framework and get deployed to Bauplan-controlled infra? are they in the control plane or the data plane?
Just trying to understand how it all fits together
It can be your laptop, an Airflow task, a prefect flow or a step function or a cron job on a VM - it's the "host" process (for the data product we picked lambda because it's the easiest way for people to "run small Python stuff every 5 minutes" - this is a prefect example: https://www.prefect.io/blog/prefect-on-the-lakehouse-write-a...).
When you interact with the Bauplan lakehouse, all the compute happen on bauplan, nothing happens in the lambda: think of launching a Snowflake query from a lambda - the client is in the lambda but all the work is done in the SF cloud. Unlike many (all?) other lakehouses, Bauplan is code-first, so you can program the entire branching and merging patterns with a few lines of code, offloading the runtime to the platform.
The platform itself runs on standard EC2, which contains the dockerized functions needed for execution - typically we manage Ec2 in single tenant, private link, soc2 compliant account we own for simplicity, but nothing prevents the VMs to be somewhere else (given connectivity is ok etc.). It is our philosophy that you should not worry about the infra part of it, so even in case of BYOC we will be in charge of managing that.
Nah, you definitely need calls. The idea that any product sells itself to the point that a venture backed startup needs is laughable. Lots of potential customers are clueless but excited and in order to book large contracts, you need someone to be a steward to work the contract through the byzantine maze of leadership and procurement.
Salespeople harangue you for calls because it's objective fact that it works to bring more dollars in, and the idea that they say some magic words and then the customer suddenly wants to buy is childish. They identify and address needs and pain points.
> Lots of potential customers are clueless but excited and in order to book large contracts, you need someone to be a steward to work the contract through the byzantine maze of leadership and procurement.
That's called exploitation, not stewardship.
It is what it is, but let's not pretend that the relationship here is anything but adversarial. The incentives are such that dishonesty and malice brings in more sales, so honest salespeople get quickly outcompeted by their dishonest co-workers, and companies with honest business models get outcompeted by those with dishonest ones. Buyers are in no position to change this, but that doesn't mean they have to pretend it's fine, or play along.
Success like he had is a filter. To be there you have to go through a lot of “sure you’ve done well so far, but you’ll never make it to the next level” conversations in your life. By the time you’re Steve Jobs, you’ve been right in the face of doubt thousands of times. Type of thing that makes someone think they can cure easily treatable pancreatic cancer with crystals or whatever.
Give it another two years of high interest rates washing people out into non-tech sales, finance, trades, whatever... but there are plenty out there, the bros are just far louder.