Parquet/arrow is a great format for Pandas - fast and nicely compact in terms of file size for those of us who don't have the luxury of directly attached NVMe SSDs and where I/O bottlenecks are a consideration.
Even after issues with its inability to map datetime64 values properly, I was reasonably happy with my design choice.
I became less happy on discovering that it's very weak as an interchange format for cross-language work.
In my case I wanted to use some existing JVM-based tooling. This caused huge pain. The Jvm/Java library/API is a complete mess, sorry, and if people are complaining about the Python/C++ documentation, there's basically nothing for the Java library.
It's barely useable and the dependencies are horrific - the whole thing is mingled with hadoop dependencies - even the API itself.
And the API is barely above exposing the file-format. Nothing like "load this parquet file" into some object which you can then query for it's contents - you're dealing with blocks and sections and other file-format level entities.
The other issue is caused its flexibility - for example Panda's dataframes are written with what's effectively a bunch of "extension metadata" which means it works great for reading and writing pandas from Python but don't expect anything to be able to work with the files out-of-the-box in other languages.
In the end, the only way I could get reliable reading and writing from the JVM was to only store numeric and string data from the Python side. Even then it feels flakey - with a bunch of hadoop warnings and deprecation warnings. I know the JVM has little appreciation in the data science world which is maybe a reason for the sorry state of the Java library.
Edit: to be specific, I am talking about my experiences with Arrow/Parquet.
What you've written sounds like a criticism of the JVM data analytics ecosystem (the Java Parquet library in particular) and not Apache Arrow itself. Parquet for Java is an independent open source project and developer community. For example, you said
> It's barely useable and the dependencies are horrific - the whole thing is mingled with hadoop dependencies - even the API itself.
For C++ / Python / R many of the developers for both Apache Arrow and Apache Parquet are the same and we currently develop the Parquet codebase out of the Arrow source tree.
So, I'm not sure what to tell you, we Arrow developers cannot take it upon ourselves to fix up the whole JVM data ecosystem.
I'm not expecting anything really and I do appreciate your work and effort. And it's a specific use case for arrow, I guess.
But at your landing page, it's claimed "Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. " and that "Libraries are available for C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.". This certainly gave me the impression that more than just Python, C++ and R would be well supported.
The JVM isn't complete irrelevant in data-science given the position of Spark/Scala. This also raised my expectations of arrow/parquet because it seems to be the de-facto standard for table storage for this JVM platform. And I experienced no issues on that platform.
To be clear, I'm not blaming you for my design decision (I'm a software engineer not a data-scientist btw), and I still think parquet/arrow rocks for Python but in my experience it doesn't really deliver a useable "cross-language" file format at the moment.
Again, I have to object to your use of “arrow/parquet”. These are not the same open source projects and while people use them together it isn’t fair to the developers of each project for you to discuss them like a single project.
FWIW, while the JVM isn't completely irrelevant in data, I will say, even as a big user of Spark via Scala, that JVM languages are quickly becoming irrelevant in data. Spark's Scala API is simultaneously the core of the platform, and also very much a second-class citizen that lacks a lot of important features that the Python API has. Easy interop with a good math library, for example.
Similarly, the reference implementation of Parquet may be in Java, but consuming it from a Java language, outside of a Spark cluster, is still a royal pain. Whereas doing it from Python isn't too bad.
Long story short, I think that expecting a project that's just trying to implement a columnar memory format to also muck out the world's filthiest elephant pen is perhaps asking too much. Though perhaps a project like Arrow could serve as the cornerstone of an effort to douse it all with kerosene and make a fresh start.
I spent a couple of years doing consultancy for life sciences research labs, most people were just using Excel and Tableau, plugged into OLAP, SQL servers, alongside Java and .NET based stores.
Stuff like Arrow doesn't come even into the radar of IT.
You do raise a very important point. At my organisation, Apache Avro was selected by the Java Devs due to the "cross-platform" marketing. However, they found out after it was too late, that the C/C++ implementations were too buggy/incomplete to effectively interoperate with the Java versions.
Keep in mind that Arrow Java<->C++/Python interop has been in production use in Apache Spark and elsewhere for multiple years now. We have avoided some of the mistakes of past projects by really emphasizing protocol integration tests across the implementations.
That could be improved without fixing the whole JVM data ecosystem, but that's mostly up to JVM developers. It's unfortunate if the Spark developers using Arrow aren't contributing in this area (especially since many of them are being paid), but it's all open source and undoubtedly pull requests are welcome.
Congratulations on the 1.0 release, it's only going to keep getting better! Really exciting to be able to share data in-memory across languages.
I can't speak to the relative merits of Julia, but I am honestly interested in anything that seeks to produce a less memory-hungry alternative to Spark.
Rust, to me, seems like a natural enough choice. It is easy to mate it to other languages, including all the major data science ones, so it would theoretically work well as the basis for a distributed compute engine that has good support for all of them as client languages. Would the same work for Julia? IIRC, it's a bytecode compiled language, which I imagine would make it difficult to link Julia libraries from other technology stacks.
I experienced all of these issues with JVM<->Python interop with Parquet. It does not surprise me that they extend to Arrow as well. It’s incredibly frustrating to say the least.
> And the API is barely above exposing the file-format. Nothing like "load this parquet file" into some object which you can then query for it's contents - you're dealing with blocks and sections and other file-format level entities.
Yes, it's just a building block. The easiest way to use Parquet on Java is to use Spark's integration with it, because it provides the query engine for you. But, if I'm not mistaken, it's much bigger than Pandas.
I also strongly dislike the Hadoop coupling (hey, it's called parquet-mr for a reason...), but it's more or less an invisible annoyance if you use Spark in an environment like Amazon EMR.
I think you'll get a lot of confused responses from people if you want to try to build an application that reads Parquet directly. It's the disk format for a bunch of distributed database engines. They'll wonder how you plan on querying it.
I found parquet files to be slower than just serializing and compressing the pandas data frames. Haven't looked back since. Of course this was an older version of pandas so the DF to parquet functionality may be much improved.
A serialization format for Pandas isn't really the core use case of Parquet. Pandas will slurp the whole thing into memory, which doesn't take advantage of the columnar format or pushdown filtering features. It gets more interesting if you use it with a tool that uses them to avoid some large fraction of disk I/O when performing a selective query.
Considering the number of segfaults I've witnessed that appear to originate in relatively recent versions of the Parquet library, I don't think your instincts are misplaced. I genuinely worry about data corruption.
I've been meaning to take a closer look at ORC. As a Spark user, I just sort of defaulted into Parquet. ORC is very similar, though, and seemingly gives every indication of being the more mature product.
On balance I find writing your own tools is useful when your use cases are narrow but becoming a wizard in other tools is useful otherwise. I'm lucky that I get to define my own use cases.
hi, Wes (Apache Arrow co-creator and Python pandas creator) here! If you're wondering what this project is all about, my JupyterCon keynote (18 min long) from 3 years ago is a good summary and the vision / scope for what we've been doing since 2016 has been pretty consistent
I'm a big fan of Pandas, and didn't know about Arrow. I've been considering do a talk advocating for a consistent data-frame api across languages since IMHO, it's the next fundamental data structure that should have baked in support everywhere. So it appears you've at least somewhat beaten me to the punch.
Since Arrow is more than an API to tabular data structures, what would you think about a Promises/A+-like specification for dataframes?
How much of the Arrow API do you think end users will wind up using, as opposed to being a lower-level framework that projects like pandas and dplyr wind up using behind the scenes?
Finally, do you think that Arrow has the potential to be the logical successor to pandas? If not, what is your long term strategy to address the shortcomings that you see in pandas?
I use pandas every day, thank you for that. Just watched this keynote and I really like the vision; I currently work with a bunch of guys who prefer DPLYR.
Is Arrow just for in-memory analytics, or are there plans to support in-database analytics too?
Thanks Wes, pandas and Arrow are great projects. Is Feather now ready for long-term storage with the V2 release? And now that it's just a renaming of the Arrow IPC format, what's its future?
You can store them long-term if you want (and you'll still be able to read them 5 years from now) but we aren't optimizing the Arrow IPC format for the _needs_ of long-term storage.
I’m also very confused about the relationship of Arrow’s stability guarantees and that of the on-disk feather. Can we safely switch from parquet to feather for long-term data storage?
same question here - why Feather and why is it named differently ?
Also is Parquet and Arrow the same ? df.to_parquet('df.parquet.gzip', compression='gzip') will not use arrow i presume ? i have to use a separate library to save to parquet using arrow. a bit confused.
Graphistry uses parquet as a more stable and thus persistent storage format when folks save data, and arrow for ephemeral internal data where we're ok (and somewhat enjoy) version changes, as that just means code version upgrades. There are performance differences in practice such as parquet having more per-column compression modes built in, making it attractive for colder storage, and arrow for in-memory/rpc/streaming/etc for similar reasons.
RE:Feather -- Arrow itself isn't necessarily a full file format -- you can imagine memory buffers being all over the heap with giant gaps inbetween b/c diff cols generated at diff times -- but in practice folks will indeed serialize to disk consolidated buffers (pa.Table -> write stream -> file). If we couldn't do that, RPC wouldn't work ;-) My understanding of Feather is it standardizes ideas around this consolidation, but we are able save to disk (within versions) without it. We found it more predictable to stick to ~Parquet for storage and Arrow buffer passing for streaming, but now that Feather networking APIs for accelerated bulk transfers may be stablizing, there may be speed advantages to using it over manual buffer streaming (and still stick w/ Parquet for persistent files).
Arrow<>Parquet conversion is super fast b/c of the co-design around similar concepts: both using record batches of dense binary column buffers means implementations can pointer-copy, memory map, use bulk copy primitives, etc. for zero-copy or at least highly accelerated interop. Python RAPIDS GPU kernels can therefore selectively stream in a few parquet columns across many parquet files through a single 900GB/s GPU, compute over them, and write back out to arrow or a new parquet.
Congrats! And a big thank you to Wes for being such a devoted community leader!
We've been on quite the journey here: https://www.graphistry.com/blog/graphistry-2-29-5-upload-100... . Think json -> protobuf -> arrow, and paralleled in our work on parallel js -> opencl+js -> python rapids/cuda for accelerated native compute over it. The blogpost demos the ~100X bigger & faster dataset result of supporting Arrow for our uploader & RAPIDS for our parser when you can't and want us to convert for you.
Something folks miss with Arrow, IMO, is it is like google protobufs for everyone else. Arrow is not just having a nice binary format, but is also ready out-of-the-box for streaming, rich datatypes, and for larger/longer-term projects, standardized & auto self-describing schemas. If you've had to manually decipher, maintain, and update generated protobuf schemas (because you aren't google with all the internal ~protobuf tooling / integrations / infra / etc.), that should sound pretty good ;-) A lot more to do, but already way ahead of most other things, esp. in aggregate.
given the sheer number and scope of Apache projects, it would be nice to have the title mention what it is (eg, "Apache Arrow - in memory analytics - 1.0.0") esp when the linked page doesn't directly say.
Normally I would agree but Apache Arrow already has significant name-recognition to people who are likely to use it -- data engineers/scientists etc. It's a library that provides an in-memory columnar format, and supplies a data engine currently used in Apache Spark, Pandas, and libraries like turbodbc, which helps these tools achieve high performance on operations on tabular data.
Having a single high-performance in-memory format means different programs can read/write from the same source without serializing/copying/deserializing. For instance, if you wanted to pass a huge table of data from R to Java to Python (because your tools span different languages), normally you'd have to copy and serde (Protobuf? JSON?) to pass the data, which equals huge overheads. With Arrow, each of those languages can directly interact with the same copy of in-memory data, in-process -- with the highest possible performance.
You also get the performance of columnar databases without implementing your own columnar data structure.
But of course, no harm adding a short description to the title to broaden its audience. Arrow is truly something amazing and the more people know about it the better.[1] Folks who program against traditional databases might not know about it, and I think they should, especially if they need to generate analytics (i.e. fast filtering/aggregation for dashboarding or for data pipeline tasks).
I spend 8+ hours a day working with/ingesting data. Had no clue. Checked with my data-using colleagues just to see if any of them had heard of Arrow - about 50% had. These are people who spend their entire day working/ingesting data in various pipelines. So, while I won't comment on whether the subtitle would be useful, I'm pretty confident that the (perhaps not large) majority of people who would/could use Apache Arrow had not hear of it.
A lot of people just stay in their lane. Having a solid description, like the one you provided, would be super useful. Perhaps a tool-tip feature of HN, even.
This is just a mismatch with how HN posts pages and how projects expect their release pages to be read. Release pages are for users to know what changed. But HN likes to link to original sources and not articles about the source. In cases like this, we end up pretending the release page is meant to be read by a wider audience than the writers ever intended it for - not Apache's fault! Apache Arrow's homepage does have that one-sentence description on it ('A cross-language development platform for in-memory analytics').
A blog post announcing version 1.0 is a good opportunity for a one-liner to explain what the heck your project is :) Just in case your audience happens to be expanding instead of contracting...
Apache Arrow? First thing the top of my head(given that I had to work with this these past few days), the python snowflake connector depends on it if you intend to fetch the results from snowflake as a pandas or dask dataframe. In other words - lots of places.
(contributor here) At my company our proprietary database layout is based on Arrow. Our workload is analytical and we are using it for read mostly workloads.
Just to be clear, Arrow is not an "in memory analytics" solution but rather an "in memory columnar data store" that might be useful when doing analytics.
I am eternally indebted to you for Pandas. Many thanks for that.
Are you talking about there being support for multiple language libraries like PyArrow or about there being multiple Apache projects that utilize Arrow like Parquet and Spark?
If not, I'm not following what sub-projects you are speaking about. As far as I know, Arrow is principally the Arrow Columnar Format and Arrow Flight with some other potentially interesting interfaces for compute kernels and CUDA devices.
The Arrow project contains implementations in multiple languages. Some of these languages contain code that can evaluate expressions against Arrow data, or even execute full queries. The C++ and Rust implementations contain query capabilities, and the Java implementation contains the Gandiva library that can delegate to C++ via JNI to evalulate expressions, for example.
Is there some documentation for this on the Arrow website somewhere? I've been looking for info on the "compute engine" that's mentioned in this 1.0 announcement but haven't found much.
In general, where's the best place to learn more about Arrow? I've approached it several times, and can find a lot about how to integrate it into other products, but none of the tools like the query engines that I would find very useful.
I checked recently, and the Rust implementation of arrow — parquet as well — is sadly still not usable on projects using a stable version of the compiler because it relies on specialization.
There are some Jira issues on this, but there doesn't seem to be a consensus on the way forward. Does someone have more information, is the general idea to wait for specialization to stabilise, or is there a plan, or even an intention, to stop relying on it?
Last I checked when I tried to the library last the blocker on stable was packed_simd, which provides vector instructions in Rust. I can imagine the arrow/datafusion-guy isn't too keen on dropping vectorized instructions as that would be letting up a huge advantage and I'm imagine it's used liberally throughout the code.
As for stablizing packed_simd, It's completely unclear to me when that will land in stable rust. I recently had a project where I just ended up calling out to C code to handle vectorization.
> I recently had a project where I just ended up calling out to C code to handle vectorization.
packed_simd provides a convenient platform independent API to some subset of common SIMD operations. Rust's standard library does have pretty much everything up through AVX2 on x86 stabilized though: https://doc.rust-lang.org/core/arch/index.html --- So if you need vectorization on x86, Rust should hopefully have you covered.
If you need other platforms or AVX-512 though, then yeah, using either unstable Rust, C or Assembly is required.
The main blocker is specialization as it's being used in several places including parquet and arrow crate. As this feature is unlikely to be stabilized, we'll need to replace it with something else but so far it is challenging.
Thanks for clarifying that the plan is to stop relying on it :) Is there a specific place/issue/ticket to discuss this? If time allows, I would be interested in helping out.
We've been using Arrow for a few years in our startup (link in profile) - it's been great as a common format for passing tabular data between Python and Javascript.
Haven't used the in-memory/zero-copy features as much, but as a binary, high-performance, typed format for the data analytics world it can't be beat. And now that the Feather format is basically the Arrow format, I expect to see it really take off as a common interchange and even storage format for medium- to even long-term projects.
(Also nice to see the reduced Python wheel size - a little bonus for the 1.0 release :) )
We extensively use Apache Arrow to store data files as parquet files on S3. This is a cheap way to store data that doesn't require the query speed of a relational (or non relational database). The main advantage of Arrow is that is a columnar database, and loses no information in transit unlike the nightmare of CSV files.
I just did some benchmarks and it's pretty similar for small files. The difference would only be noticeable if you're serializing a ton of small files.
Wes McKinney (Pandas creator and Arrow co-creator) will be giving an Arrow talk live this Thursday for those who want to learn more -> https://subsurfaceconf.com/summer2020
Apache Arrow is a nice in-memory data structure finding its usage in wide variety of projects; especially in data science due to its feather format which can be used to save data frame from memory to the disk.
This can particularly be useful for low memory systems like ARM SBC when conducting long duration research. If you want to build Apache Arrow from source for ARM, I've written a How-To here[1].
Glad to see the 1.0.0 release. I'm hoping that it will help reduce resistance when I nudge people toward it. It also good to see progress in the documentation. For a long time its been very hard to dig into anything but the C++ or Python libraries.
As I try to move to Arrow/Flight over using JSON or binary formats like Protobuf, one thing I see missing is tooling or schema -> code generation. It would be nice to see some progress or roadmap in that direction as well.
Polyglot, zero-copy serialization protocols (arrow, flatbuffer, capn proto) is a trend I really appreciate.
I like to imagine a future where data is freed from the languages/tools used to operate on them. In-memory objects would then be views into data stored in shared memory (or disk), with the data easily read/manipulated from multiple languages.
It's basically a terrible joke at this point. There's no single Apache page helping you to decide which one you want, and they all seem to have such large overlap. Most of them seem to have bad documentation, and give the appearence of not really being maintained.
This puts me off even trying to use them. If there's this much scope creep/NIH/reinventing the wheel happening across the board, I can't imagine how bad each product is individually.
When I first saw Arrow I thought the biggest benefit was a good file format. (Feather I think is the format. ) However it seems to be designed for passing tables between languages in the same process.
If I never use multiple languages in the same process can I safely ignore it or are there other benefits?
In Python, we use Arrow a lot for going between diff frameworks: file -> ( pandas <> cudf <> cugraph ). When we work with DB vendors, we now push them to provide Arrow instead of whatever proprietary in-house thing they've been pushing for the same, e.g., slow JSON / ODBC, untyped CSV, or some ad-hoc wire protocol. Also, we use for going between our services within a server + across.
On the nodejs services + frontend JS side, there was no real equiv tool for interop -- traditional soln is slowly round-tripping through SQL/ORM or manually doing protobuf through some sort of pubsub -- so Arrow is part of how we have been taming that mess too. It'll be a longer journey for Arrow or something like it to get adopted in JS land, but I can see folks doing TypeScript and serverless wanting it for a cleaner solution to typed data, faster serialization & streaming, etc. (There is no true JS equiv of pandas nor a typed variant.) We were a bit early here b/c we wanted streaming through WebGL + OpenCL/CUDA, and while we are fans of typed data, found protobuf tooling to be too unintegrated and manual in practice.
How successful have you been in pushing database vendors to use arrow on the wire? We've started using turbodbc (to connect to SQL Server), but as I understand it the data on the wire is still in odbc's row-oriented wire format, and turbodbc is responsible for packing into arrow column oriented buffers.
As you mention in your second paragraph, arrow is perfect for getting typed query results from a database server directly to a web frontend with minimal overhead. With the Transferable interface, it's even possible to zero-copy transfer the arrow data from the network buffer to a webworker.
Hdf I've not used in any production workflow. Parquet I've really only used with impala. If I'm working in that system I'm in Spark already, typically.
I wish Excel / Google Sheets supported Parquet export. I've worked with CSV files and Parquet files and found the latter a vast improvement. Chunking data is trivial, you have fixed-size data types, and existing types are well-supported across a multitude of frameworks.
In comparison, CSVs are pretty much unstructured text files with added suggestions.
More pipelines should natively support Parquet files, or something like it, and have a thin CSV to Parquet conversion layer on top.
The docs are dire. Lots and lots about internals and how it works, and after a few minutes of searching, I've yet to see what it actually does or what I'd want to use it for. The "use cases" are even overly technical things that are building blocks for ... something. What is that something?
If you are already doing Big Data / ML / deep learning work in Spark or Dask, Arrow is (with caveats) more or less a drop-in performance optimization (because it's optimized for in-memory vectors) and complexity reducer (because it does all the interop between JVM, Python, TF, etc)
So by itself it doesn't necessarily introduce new use cases that weren't there before, except where performance was a bottleneck.
But we're talking like 500x performance increase on workloads vs ODBC, so night and day for a lot of data scientist pipelines.
Judging from the faq (https://arrow.apache.org/faq/), it seems like it's a storage format for large structured datasets, designed specifically for use in-memory and for serialization across the network. It provides a spec for the data format, as well as implementations for several programming languages.
That seems pretty straightforward, what part was confusing for you?
It boils down to a standard columnar format usable from disk or memory that is useful for data interchange between languages or frameworks, as well as any use case where a columnar format alone has benefits.
judging from the comments here, it's an in-memory (in-process?) columnar database format without a DBMS attached. sounds useful when sqlite is too slow for your aggregations or if you do processing in X (C, Java) and visualization in Y (R, Python) - just guessing, haven't used it.
Can you talk about how you use it? At a cursory glance, it seems like it's being used to great effect in databases, or systems designed to query data. Is that what you're using for?
From my experience, Parquet files are a better alternative to CSV or SQLite files for data at rest. I've observed that parquet files tend to be smaller than Feather files (which is essentially the Arrow format in a file) for the same data. Arrow is great for in-memory data, but even with compression, the on-disk size of Feather files tend to be quite a bit bigger than Parquet files.
"for interchange" in data science might well mean data on-the-move rather than data at-rest: i.e. a pipeline in two stages, step A in python and step B in R. The data is exchanged (relatively) quickly via disk, and speed beats compression.
Even with fast SSDs, memcpy() is roughly an order of magnitude faster than writing to disk. Using a fast compression algorithm like lz4, zstd, etc can help reduce that gap. And this is invariant of whether the serialization format is Arrow or Parquet. It’s not either, or. If you’re writing to disk, with compression you can have both speed and reduction in the disk footprint.
Big fan of Arrow (and Parquet) and we use it at UrbanLogiq extensively. It's a wonderful toolset and great for data interop across multiple languages and environments!
I used Apache Arrow to store and process a relatively large amount of data, 2-3B records with ~10 fields each (still fits on one machine, but it's large enough that I have to think about it a little bit).
I was originally using a simple binary format for fast decoding, and switched to Arrow to be able to select only a few columns at a time. I was impressed by the speed gains, and the size benefits of storing data in column order.
The one thing I wish Arrow had was a way to attach some metadata to its files (if there is, I haven't found it). I originally tried to write a small header before starting to write the Arrow data, but that made it impossible to read back the data: as far as I can tell Arrow looks at the whole file and stores its column definition at the end of the file and computes data offsets based on the start of the file, meaning that there's nowhere left for me to store anything.
It's still a very promising library and I'll definitely be checking out the 1.0 release.
pyarrow.metadata in the header is for stuffing in uninterpreted bytes, so we put in semantic stringified json data there for version numbers etc. that are beyond the data rep typing . Super useful when maturing from localized notebook code to software services standardized on arrow interop.
HDF5 has no support for dataframes/tables, although projects have been built ontop of it like PyTables, but these are non-standard and don’t work across languages.
So, maybe I'm stupid and never faced enough big data, but what's the advantage of this versus a custom script that queries A and inserts into B?
I do that kind of stuff all the time with Go and it's pretty fast with 20~40 million records, averaging 100 KB each. Are those tools oriented to billions, instead of millions? What are the benefits?
* Standardizes binary interop and "serialization" of large structured data, removing all conversions / serialization at ingest and export boundaries. This alone can mean > 2-100x performance improvement in an application that processes a lot of data
* The Arrow in-memory format is an ideal data structure to code analytical algorithms against.
Check out the post from author of pandas on pandas, pandas2 and arrow [1].
This might answer your question on what is the significance of arrow, given today pandas are kind of basic ingredient in scientific computing, AI and ML.
I see Arrow as a "data frame server" and data frames have a higher level of abstraction for use in analysis. But you're right that raw processing can be done with most languages natively much faster. But Arrow can let a Python statistician and an R statistician compare apples to apples easily.
That's precisely it -- you don't have to use protobuf etc. because you don't have to serialize/deserialize at all.
Languages that use Arrow can interact with the source data directly in-memory, in-process. No need to move any data around. The fastest serde operation is one that you don't have to do at all.
Flatbuffers is a low-level building block for binary data serialization. It is not adapted to the representation of large, structured, homogenous data, and does not sit at the right abstraction layer for data analysis tasks.
Arrow is a data layer aimed directly at the needs of data analysis, providing a comprehensive collection of data types required to analytics, built-in support for “null” values (representing missing data), and an expanding toolbox of I/O and computing facilities.
The Arrow file format does use Flatbuffers under the hood to serialize schemas and other metadata needed to implement the Arrow binary IPC protocol, but the Arrow data format uses its own representation for optimal access and computation.
Zero copy serialisation is offered by flatbuffers - and one additionally has the advantage of IDL as a contract between different technology platforms.
At my company, we query from A and store the data in PyArrow Parquet files on S3. Using Athena this data can then be directly queried. This is a cheap way to store data, that doesn't need the performance of database server.
Arrays of structs fit pretty naturally into many languages, but structs of arrays (and keeping corresponding values in order) are unusual enough to need plumbing that probably isn't there.
Even after issues with its inability to map datetime64 values properly, I was reasonably happy with my design choice.
I became less happy on discovering that it's very weak as an interchange format for cross-language work.
In my case I wanted to use some existing JVM-based tooling. This caused huge pain. The Jvm/Java library/API is a complete mess, sorry, and if people are complaining about the Python/C++ documentation, there's basically nothing for the Java library. It's barely useable and the dependencies are horrific - the whole thing is mingled with hadoop dependencies - even the API itself.
And the API is barely above exposing the file-format. Nothing like "load this parquet file" into some object which you can then query for it's contents - you're dealing with blocks and sections and other file-format level entities.
The other issue is caused its flexibility - for example Panda's dataframes are written with what's effectively a bunch of "extension metadata" which means it works great for reading and writing pandas from Python but don't expect anything to be able to work with the files out-of-the-box in other languages.
In the end, the only way I could get reliable reading and writing from the JVM was to only store numeric and string data from the Python side. Even then it feels flakey - with a bunch of hadoop warnings and deprecation warnings. I know the JVM has little appreciation in the data science world which is maybe a reason for the sorry state of the Java library.
Edit: to be specific, I am talking about my experiences with Arrow/Parquet.