Hey guys, first just wanted to say, amazing work on this library, it's really impressive and I'm excited both to try things myself and to see what everyone else ends up doing with it 🙂
Anyways, this isn't an ISSUE so much as a QUESTION, something for my curiosity, so no worries if you don't get back to it for a while, or at all -- just figured I'd ask since I think it's interesting, hope that's ok.
So I wanted to play around a bit in the shell, and I uploaded a parquet file (a common one, a subset of the NYC taxi ride dataset, about 500MB in total) to a public S3 bucket and set the CORS so that it could be read by shell.duckdb.org
. Then, I went to try out a few queries on the remote data file:
First of all, I want to say again this is just amazing -- it pulled less then 30MB to answer an analytical query like this on a 500MB dataset, just blows my mind 💯
But secondly -- it's kinda slow! That query took over a minute to come back, most of it seeming to be network. To confirm, I opened up the devtools and checked the network, and saw a whole bunch of tiny sequential range requests that seemed to be making sure we couldn't get too much work done at once:
So, I guess the TL;DR is that I want to know more about these! I'm guessing they can't "just" be fully parallelized, i.e. we need information from the last response in order to know what the next request looks like. But, maybe we could make fewer of these calls if we could increase the byte request size when the parquet is large? Or implement some kind of batching? Since this was the only hitch I ran into in an otherwise really cool experiment, I just wanted to hear your thoughts or plans.
Thanks a lot in advance, both for looking at my question and for building this in the first place!