Apache Arrow 1.0

derriz · on July 27, 2020

Parquet/arrow is a great format for Pandas - fast and nicely compact in terms of file size for those of us who don't have the luxury of directly attached NVMe SSDs and where I/O bottlenecks are a consideration.

Even after issues with its inability to map datetime64 values properly, I was reasonably happy with my design choice.

I became less happy on discovering that it's very weak as an interchange format for cross-language work.

In my case I wanted to use some existing JVM-based tooling. This caused huge pain. The Jvm/Java library/API is a complete mess, sorry, and if people are complaining about the Python/C++ documentation, there's basically nothing for the Java library. It's barely useable and the dependencies are horrific - the whole thing is mingled with hadoop dependencies - even the API itself.

And the API is barely above exposing the file-format. Nothing like "load this parquet file" into some object which you can then query for it's contents - you're dealing with blocks and sections and other file-format level entities.

The other issue is caused its flexibility - for example Panda's dataframes are written with what's effectively a bunch of "extension metadata" which means it works great for reading and writing pandas from Python but don't expect anything to be able to work with the files out-of-the-box in other languages.

In the end, the only way I could get reliable reading and writing from the JVM was to only store numeric and string data from the Python side. Even then it feels flakey - with a bunch of hadoop warnings and deprecation warnings. I know the JVM has little appreciation in the data science world which is maybe a reason for the sorry state of the Java library.

Edit: to be specific, I am talking about my experiences with Arrow/Parquet.

wesm · on July 27, 2020

What you've written sounds like a criticism of the JVM data analytics ecosystem (the Java Parquet library in particular) and not Apache Arrow itself. Parquet for Java is an independent open source project and developer community. For example, you said

> It's barely useable and the dependencies are horrific - the whole thing is mingled with hadoop dependencies - even the API itself.

These are comments about http://github.com/apache/parquet-mr which is a different open source project.

For C++ / Python / R many of the developers for both Apache Arrow and Apache Parquet are the same and we currently develop the Parquet codebase out of the Arrow source tree.

So, I'm not sure what to tell you, we Arrow developers cannot take it upon ourselves to fix up the whole JVM data ecosystem.

derriz · on July 27, 2020

I'm not expecting anything really and I do appreciate your work and effort. And it's a specific use case for arrow, I guess.

But at your landing page, it's claimed "Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. " and that "Libraries are available for C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.". This certainly gave me the impression that more than just Python, C++ and R would be well supported.

The JVM isn't complete irrelevant in data-science given the position of Spark/Scala. This also raised my expectations of arrow/parquet because it seems to be the de-facto standard for table storage for this JVM platform. And I experienced no issues on that platform.

To be clear, I'm not blaming you for my design decision (I'm a software engineer not a data-scientist btw), and I still think parquet/arrow rocks for Python but in my experience it doesn't really deliver a useable "cross-language" file format at the moment.

wesm · on July 27, 2020

Again, I have to object to your use of “arrow/parquet”. These are not the same open source projects and while people use them together it isn’t fair to the developers of each project for you to discuss them like a single project.

mumblemumble · on July 27, 2020

FWIW, while the JVM isn't completely irrelevant in data, I will say, even as a big user of Spark via Scala, that JVM languages are quickly becoming irrelevant in data. Spark's Scala API is simultaneously the core of the platform, and also very much a second-class citizen that lacks a lot of important features that the Python API has. Easy interop with a good math library, for example.

Similarly, the reference implementation of Parquet may be in Java, but consuming it from a Java language, outside of a Spark cluster, is still a royal pain. Whereas doing it from Python isn't too bad.

Long story short, I think that expecting a project that's just trying to implement a columnar memory format to also muck out the world's filthiest elephant pen is perhaps asking too much. Though perhaps a project like Arrow could serve as the cornerstone of an effort to douse it all with kerosene and make a fresh start.

pjmlp · on July 28, 2020

I spent a couple of years doing consultancy for life sciences research labs, most people were just using Excel and Tableau, plugged into OLAP, SQL servers, alongside Java and .NET based stores.

Stuff like Arrow doesn't come even into the radar of IT.

grumpyprole · on July 28, 2020

You do raise a very important point. At my organisation, Apache Avro was selected by the Java Devs due to the "cross-platform" marketing. However, they found out after it was too late, that the C/C++ implementations were too buggy/incomplete to effectively interoperate with the Java versions.

wesm · on July 28, 2020

Keep in mind that Arrow Java<->C++/Python interop has been in production use in Apache Spark and elsewhere for multiple years now. We have avoided some of the mistakes of past projects by really emphasizing protocol integration tests across the implementations.

nl · on July 28, 2020

I believe that the Spark parquet library is available to be used in plain old Java: https://www.arm64.ca/post/reading-parquet-files-java/

joelwilsson · on July 27, 2020

Just for Apache Arrow itself, https://arrow.apache.org/docs/java/ compared to https://arrow.apache.org/docs/js/ or https://arrow.apache.org/docs/cpp/ doesn't look promising in terms of the documentation being usable.

That could be improved without fixing the whole JVM data ecosystem, but that's mostly up to JVM developers. It's unfortunate if the Spark developers using Arrow aren't contributing in this area (especially since many of them are being paid), but it's all open source and undoubtedly pull requests are welcome.

Congratulations on the 1.0 release, it's only going to keep getting better! Really exciting to be able to share data in-memory across languages.

MrPowers · on July 27, 2020

Arrow is tightly integrated in Spark's Scala code: https://github.com/apache/spark/search?q=arrow&unscoped_q=ar...

Andy Grove is building a "distributed compute platform implemented in Rust, using the Apache Arrow memory model": https://github.com/ballista-compute/ballista. Seems possible.

xiaodai · on July 27, 2020

Rust will not succeed. Ppl can't debug spakr already and rust, even fewer ppl cna debug.

Build a spark killer in Julia. Everyone can read the code.

mumblemumble · on July 28, 2020

I can't speak to the relative merits of Julia, but I am honestly interested in anything that seeks to produce a less memory-hungry alternative to Spark.

Rust, to me, seems like a natural enough choice. It is easy to mate it to other languages, including all the major data science ones, so it would theoretically work well as the basis for a distributed compute engine that has good support for all of them as client languages. Would the same work for Julia? IIRC, it's a bytecode compiled language, which I imagine would make it difficult to link Julia libraries from other technology stacks.

MrPowers · on July 28, 2020

There is a project called vega that's a Spark implementation in Rust: https://github.com/rajasekarv/vega. It looks promising.

Databricks rewrote the Spark engine in C++ (called the Delta Engine, see here: https://databricks.com/blog/2020/06/24/introducing-delta-eng...).

Hopefully there will be some good alternatives that don't consume so much memory soon ;)

andygrove · on July 28, 2020

There is also https://github.com/ballista-compute/ballista (shameless plug for my own project)

pjmlp · on July 28, 2020

Hopefully they have all measures in place to prevent data to be tainted by C and C++ "features".

pjmlp · on July 28, 2020

It is the bytecode compiled language like any other that uses LLVM as their backend.

The alternative to Spark will be Spark, as JVM slowly improves their memory APIs from release to release.

People tend to hand wave how much it costs to reboot an ecosystem.

choppaface · on July 28, 2020

If you use pyspark and Spark for Parquet, you get:

* Easy type inference, even for nested maps and structs

* lz4 compression support

* SQL and directory partitioning out of the box

If you use pyarrow, you get: * Write support for nested types, but read support is broken / incomplete (it throws a TODO error)

* no lz4 support and a load of Jira politics blocking it

* You can query using Pandas (not SQL), but that querying can be much slower versus Spark (Spark is naturally parallelized)

pyarrow might need to hit 2.0.0 to be really viable. It’s definitely easier to use than parquet-mr though.

wesm · on July 28, 2020

There is no “JIRA politics” blocking the LZ4 work, only a lack of volunteers to do the development and testing.

cheez · on July 28, 2020

Hmm interesting. I will be investigating the performance of this bad boy.

teej · on July 27, 2020

I experienced all of these issues with JVM<->Python interop with Parquet. It does not surprise me that they extend to Arrow as well. It’s incredibly frustrating to say the least.

zten · on July 27, 2020

> And the API is barely above exposing the file-format. Nothing like "load this parquet file" into some object which you can then query for it's contents - you're dealing with blocks and sections and other file-format level entities.

Yes, it's just a building block. The easiest way to use Parquet on Java is to use Spark's integration with it, because it provides the query engine for you. But, if I'm not mistaken, it's much bigger than Pandas.

I also strongly dislike the Hadoop coupling (hey, it's called parquet-mr for a reason...), but it's more or less an invisible annoyance if you use Spark in an environment like Amazon EMR.

I think you'll get a lot of confused responses from people if you want to try to build an application that reads Parquet directly. It's the disk format for a bunch of distributed database engines. They'll wonder how you plan on querying it.

cheez · on July 28, 2020

I found parquet files to be slower than just serializing and compressing the pandas data frames. Haven't looked back since. Of course this was an older version of pandas so the DF to parquet functionality may be much improved.

mumblemumble · on July 28, 2020

A serialization format for Pandas isn't really the core use case of Parquet. Pandas will slurp the whole thing into memory, which doesn't take advantage of the columnar format or pushdown filtering features. It gets more interesting if you use it with a tool that uses them to avoid some large fraction of disk I/O when performing a selective query.

cheez · on July 28, 2020

I just didn't trust the tools to do the job properly so for now I just split it up myself.

mumblemumble · on July 28, 2020

Considering the number of segfaults I've witnessed that appear to originate in relatively recent versions of the Parquet library, I don't think your instincts are misplaced. I genuinely worry about data corruption.

I've been meaning to take a closer look at ORC. As a Spark user, I just sort of defaulted into Parquet. ORC is very similar, though, and seemingly gives every indication of being the more mature product.

cheez · on July 28, 2020

On balance I find writing your own tools is useful when your use cases are narrow but becoming a wizard in other tools is useful otherwise. I'm lucky that I get to define my own use cases.

wesm · on July 27, 2020

hi, Wes (Apache Arrow co-creator and Python pandas creator) here! If you're wondering what this project is all about, my JupyterCon keynote (18 min long) from 3 years ago is a good summary and the vision / scope for what we've been doing since 2016 has been pretty consistent

https://www.youtube.com/watch?v=wdmf1msbtVs

altgoogler · on July 28, 2020

Hi Wes!

I'm a big fan of Pandas, and didn't know about Arrow. I've been considering do a talk advocating for a consistent data-frame api across languages since IMHO, it's the next fundamental data structure that should have baked in support everywhere. So it appears you've at least somewhat beaten me to the punch.

Since Arrow is more than an API to tabular data structures, what would you think about a Promises/A+-like specification for dataframes?

How much of the Arrow API do you think end users will wind up using, as opposed to being a lower-level framework that projects like pandas and dplyr wind up using behind the scenes?

Finally, do you think that Arrow has the potential to be the logical successor to pandas? If not, what is your long term strategy to address the shortcomings that you see in pandas?

ra · on July 29, 2020

Hi Wes

I use pandas every day, thank you for that. Just watched this keynote and I really like the vision; I currently work with a bunch of guys who prefer DPLYR.

Is Arrow just for in-memory analytics, or are there plans to support in-database analytics too?

tgb · on July 27, 2020

Thanks Wes, pandas and Arrow are great projects. Is Feather now ready for long-term storage with the V2 release? And now that it's just a renaming of the Arrow IPC format, what's its future?

wesm · on July 27, 2020

See

http://arrow.apache.org/faq/index.html#what-about-arrow-file...

You can store them long-term if you want (and you'll still be able to read them 5 years from now) but we aren't optimizing the Arrow IPC format for the _needs_ of long-term storage.

tbenst · on July 27, 2020

I’m also very confused about the relationship of Arrow’s stability guarantees and that of the on-disk feather. Can we safely switch from parquet to feather for long-term data storage?

sandGorgon · on July 27, 2020

same question here - why Feather and why is it named differently ?

Also is Parquet and Arrow the same ? df.to_parquet('df.parquet.gzip', compression='gzip') will not use arrow i presume ? i have to use a separate library to save to parquet using arrow. a bit confused.

lmeyerov · on July 27, 2020

Graphistry uses parquet as a more stable and thus persistent storage format when folks save data, and arrow for ephemeral internal data where we're ok (and somewhat enjoy) version changes, as that just means code version upgrades. There are performance differences in practice such as parquet having more per-column compression modes built in, making it attractive for colder storage, and arrow for in-memory/rpc/streaming/etc for similar reasons.

RE:Feather -- Arrow itself isn't necessarily a full file format -- you can imagine memory buffers being all over the heap with giant gaps inbetween b/c diff cols generated at diff times -- but in practice folks will indeed serialize to disk consolidated buffers (pa.Table -> write stream -> file). If we couldn't do that, RPC wouldn't work ;-) My understanding of Feather is it standardizes ideas around this consolidation, but we are able save to disk (within versions) without it. We found it more predictable to stick to ~Parquet for storage and Arrow buffer passing for streaming, but now that Feather networking APIs for accelerated bulk transfers may be stablizing, there may be speed advantages to using it over manual buffer streaming (and still stick w/ Parquet for persistent files).

Arrow<>Parquet conversion is super fast b/c of the co-design around similar concepts: both using record batches of dense binary column buffers means implementations can pointer-copy, memory map, use bulk copy primitives, etc. for zero-copy or at least highly accelerated interop. Python RAPIDS GPU kernels can therefore selectively stream in a few parquet columns across many parquet files through a single 900GB/s GPU, compute over them, and write back out to arrow or a new parquet.

lmeyerov · on July 27, 2020

Congrats! And a big thank you to Wes for being such a devoted community leader!

We've been on quite the journey here: https://www.graphistry.com/blog/graphistry-2-29-5-upload-100... . Think json -> protobuf -> arrow, and paralleled in our work on parallel js -> opencl+js -> python rapids/cuda for accelerated native compute over it. The blogpost demos the ~100X bigger & faster dataset result of supporting Arrow for our uploader & RAPIDS for our parser when you can't and want us to convert for you.

Something folks miss with Arrow, IMO, is it is like google protobufs for everyone else. Arrow is not just having a nice binary format, but is also ready out-of-the-box for streaming, rich datatypes, and for larger/longer-term projects, standardized & auto self-describing schemas. If you've had to manually decipher, maintain, and update generated protobuf schemas (because you aren't google with all the internal ~protobuf tooling / integrations / infra / etc.), that should sound pretty good ;-) A lot more to do, but already way ahead of most other things, esp. in aggregate.

neurobashing · on July 27, 2020

given the sheer number and scope of Apache projects, it would be nice to have the title mention what it is (eg, "Apache Arrow - in memory analytics - 1.0.0") esp when the linked page doesn't directly say.

wenc · on July 27, 2020

Normally I would agree but Apache Arrow already has significant name-recognition to people who are likely to use it -- data engineers/scientists etc. It's a library that provides an in-memory columnar format, and supplies a data engine currently used in Apache Spark, Pandas, and libraries like turbodbc, which helps these tools achieve high performance on operations on tabular data.

Having a single high-performance in-memory format means different programs can read/write from the same source without serializing/copying/deserializing. For instance, if you wanted to pass a huge table of data from R to Java to Python (because your tools span different languages), normally you'd have to copy and serde (Protobuf? JSON?) to pass the data, which equals huge overheads. With Arrow, each of those languages can directly interact with the same copy of in-memory data, in-process -- with the highest possible performance.

You also get the performance of columnar databases without implementing your own columnar data structure.

But of course, no harm adding a short description to the title to broaden its audience. Arrow is truly something amazing and the more people know about it the better.[1] Folks who program against traditional databases might not know about it, and I think they should, especially if they need to generate analytics (i.e. fast filtering/aggregation for dashboarding or for data pipeline tasks).

[1] Overview: https://arrow.apache.org/overview/

ghshephard · on July 27, 2020

I spend 8+ hours a day working with/ingesting data. Had no clue. Checked with my data-using colleagues just to see if any of them had heard of Arrow - about 50% had. These are people who spend their entire day working/ingesting data in various pipelines. So, while I won't comment on whether the subtitle would be useful, I'm pretty confident that the (perhaps not large) majority of people who would/could use Apache Arrow had not hear of it.

A lot of people just stay in their lane. Having a solid description, like the one you provided, would be super useful. Perhaps a tool-tip feature of HN, even.

casion · on July 27, 2020

I'm a potential user and had no clue. Not everyone keeps up with trends.

In fact, if my coworkers (educators) are any indication, few people do.

Even 2 sentences on this page would've helped a lot.

tgb · on July 27, 2020

This is just a mismatch with how HN posts pages and how projects expect their release pages to be read. Release pages are for users to know what changed. But HN likes to link to original sources and not articles about the source. In cases like this, we end up pretending the release page is meant to be read by a wider audience than the writers ever intended it for - not Apache's fault! Apache Arrow's homepage does have that one-sentence description on it ('A cross-language development platform for in-memory analytics').

m0dest · on July 28, 2020

A blog post announcing version 1.0 is a good opportunity for a one-liner to explain what the heck your project is :) Just in case your audience happens to be expanding instead of contracting...

irontinkerer · on July 27, 2020

And it has a broad-level of support through it's libraries!

"Libraries are available for C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust."

aeontech · on July 27, 2020

This is a really nice summary, thank you!

buryat · on July 27, 2020

where is it used in production?

wenc · on July 27, 2020

Here's a list (and as mentioned in my comment, Apache Spark, Pandas, turbodbc)

https://arrow.apache.org/powered_by/

fotta · on July 27, 2020

Spark uses it under the hood to interop between PySpark and JVM

agacera · on July 27, 2020

Dremio is a really nice product built using arrow.

https://www.dremio.com

axegon_ · on July 27, 2020

Apache Arrow? First thing the top of my head(given that I had to work with this these past few days), the python snowflake connector depends on it if you intend to fetch the results from snowflake as a pandas or dask dataframe. In other words - lots of places.

vertexclique · on July 27, 2020

(contributor here) At my company our proprietary database layout is based on Arrow. Our workload is analytical and we are using it for read mostly workloads.

uberman · on July 27, 2020

Just to be clear, Arrow is not an "in memory analytics" solution but rather an "in memory columnar data store" that might be useful when doing analytics.

wesm · on July 27, 2020

This isn't accurate -- there are multiple query engine subprojects within Apache Arrow.

uberman · on July 27, 2020

I am eternally indebted to you for Pandas. Many thanks for that.

Are you talking about there being support for multiple language libraries like PyArrow or about there being multiple Apache projects that utilize Arrow like Parquet and Spark?

If not, I'm not following what sub-projects you are speaking about. As far as I know, Arrow is principally the Arrow Columnar Format and Arrow Flight with some other potentially interesting interfaces for compute kernels and CUDA devices.

Am I missing something?

andygrove · on July 28, 2020

The Arrow project contains implementations in multiple languages. Some of these languages contain code that can evaluate expressions against Arrow data, or even execute full queries. The C++ and Rust implementations contain query capabilities, and the Java implementation contains the Gandiva library that can delegate to C++ via JNI to evalulate expressions, for example.

epistasis · on July 27, 2020

Is there some documentation for this on the Arrow website somewhere? I've been looking for info on the "compute engine" that's mentioned in this 1.0 announcement but haven't found much.

In general, where's the best place to learn more about Arrow? I've approached it several times, and can find a lot about how to integrate it into other products, but none of the tools like the query engines that I would find very useful.

j45 · on July 27, 2020

I find the same, but normally most Apache projects have an overview link at the top of the project.

tomhoule · on July 27, 2020

I checked recently, and the Rust implementation of arrow — parquet as well — is sadly still not usable on projects using a stable version of the compiler because it relies on specialization.

There are some Jira issues on this, but there doesn't seem to be a consensus on the way forward. Does someone have more information, is the general idea to wait for specialization to stabilise, or is there a plan, or even an intention, to stop relying on it?

nemothekid · on July 27, 2020

Last I checked when I tried to the library last the blocker on stable was packed_simd, which provides vector instructions in Rust. I can imagine the arrow/datafusion-guy isn't too keen on dropping vectorized instructions as that would be letting up a huge advantage and I'm imagine it's used liberally throughout the code.

As for stablizing packed_simd, It's completely unclear to me when that will land in stable rust. I recently had a project where I just ended up calling out to C code to handle vectorization.

EDIT: According to, https://github.com/rust-lang/lang-team/issues/29, the effort looks abandonded/deprioritized, so it may be a long time before it sees stable rust

burntsushi · on July 27, 2020

> I recently had a project where I just ended up calling out to C code to handle vectorization.

packed_simd provides a convenient platform independent API to some subset of common SIMD operations. Rust's standard library does have pretty much everything up through AVX2 on x86 stabilized though: https://doc.rust-lang.org/core/arch/index.html --- So if you need vectorization on x86, Rust should hopefully have you covered.

If you need other platforms or AVX-512 though, then yeah, using either unstable Rust, C or Assembly is required.

lordsunland · on July 27, 2020

The main blocker is specialization as it's being used in several places including parquet and arrow crate. As this feature is unlikely to be stabilized, we'll need to replace it with something else but so far it is challenging.

tomhoule · on July 27, 2020

Thanks for clarifying that the plan is to stop relying on it :) Is there a specific place/issue/ticket to discuss this? If time allows, I would be interested in helping out.

lordsunland · on July 27, 2020

There is https://issues.apache.org/jira/browse/ARROW-6717 tracking moving to stable Rust, and https://issues.apache.org/jira/browse/ARROW-4678 is also relevant. Feel free to create a sub-task under the former to remove specialization :)

mands · on July 27, 2020

Congrats on the 1.0 release!

We've been using Arrow for a few years in our startup (link in profile) - it's been great as a common format for passing tabular data between Python and Javascript.

Haven't used the in-memory/zero-copy features as much, but as a binary, high-performance, typed format for the data analytics world it can't be beat. And now that the Feather format is basically the Arrow format, I expect to see it really take off as a common interchange and even storage format for medium- to even long-term projects.

(Also nice to see the reduced Python wheel size - a little bonus for the 1.0 release :) )

oxfordmale · on July 27, 2020

We extensively use Apache Arrow to store data files as parquet files on S3. This is a cheap way to store data that doesn't require the query speed of a relational (or non relational database). The main advantage of Arrow is that is a columnar database, and loses no information in transit unlike the nightmare of CSV files.

lsorber · on July 27, 2020

Have you benchmarked this against pickling those data files? In our experience, parquet's overhead isn't worth it for smaller data files.

alfalfasprout · on July 27, 2020

I just did some benchmarks and it's pretty similar for small files. The difference would only be noticeable if you're serializing a ton of small files.

lsorber · on July 31, 2020

Huh, makes a pretty big difference for us. We were using pandas' built-in to_parquet though, which seems to suffer from some overhead.

EdwardDiego · on July 27, 2020

I'm not surprised, Parquet's columnar encoding and compression won't really kick in significantly for smaller files.

kylebarron · on July 27, 2020

But with pickling you can only read the data in Python.

cbsmith · on July 28, 2020

If pickling is what is working best for you, it can't be much data.

unclepadre · on July 27, 2020

Wes McKinney (Pandas creator and Arrow co-creator) will be giving an Arrow talk live this Thursday for those who want to learn more -> https://subsurfaceconf.com/summer2020

Abishek_Muthian · on July 27, 2020

Apache Arrow is a nice in-memory data structure finding its usage in wide variety of projects; especially in data science due to its feather format which can be used to save data frame from memory to the disk.

This can particularly be useful for low memory systems like ARM SBC when conducting long duration research. If you want to build Apache Arrow from source for ARM, I've written a How-To here[1].

[1]https://gist.github.com/heavyinfo/04e1326bb9bed9cecb19c2d603...

gwittel · on July 27, 2020

Glad to see the 1.0.0 release. I'm hoping that it will help reduce resistance when I nudge people toward it. It also good to see progress in the documentation. For a long time its been very hard to dig into anything but the C++ or Python libraries.

As I try to move to Arrow/Flight over using JSON or binary formats like Protobuf, one thing I see missing is tooling or schema -> code generation. It would be nice to see some progress or roadmap in that direction as well.

snicker7 · on July 27, 2020

Polyglot, zero-copy serialization protocols (arrow, flatbuffer, capn proto) is a trend I really appreciate.

I like to imagine a future where data is freed from the languages/tools used to operate on them. In-memory objects would then be views into data stored in shared memory (or disk), with the data easily read/manipulated from multiple languages.

lloydatkinson · on July 28, 2020

I said this before but:

Oh look, yet another Apache real time/batch/big data/stream processing/ingestion/workflow/whatever product.

  Apache Druid
  Apache Spark
  Apache Storm
  Apache Flink
  Apache Beam
  Apache Apex
  Apache Airavata
  Apache Samza
  Apache TEZ
  Apache Hama

It's basically a terrible joke at this point. There's no single Apache page helping you to decide which one you want, and they all seem to have such large overlap. Most of them seem to have bad documentation, and give the appearence of not really being maintained. This puts me off even trying to use them. If there's this much scope creep/NIH/reinventing the wheel happening across the board, I can't imagine how bad each product is individually.

Apache Kafka seems to be the only exception...

wowi42 · on July 28, 2020

Apache Pulsar?

x87678r · on July 27, 2020

When I first saw Arrow I thought the biggest benefit was a good file format. (Feather I think is the format. ) However it seems to be designed for passing tables between languages in the same process.

If I never use multiple languages in the same process can I safely ignore it or are there other benefits?

lmeyerov · on July 28, 2020

In Python, we use Arrow a lot for going between diff frameworks: file -> ( pandas <> cudf <> cugraph ). When we work with DB vendors, we now push them to provide Arrow instead of whatever proprietary in-house thing they've been pushing for the same, e.g., slow JSON / ODBC, untyped CSV, or some ad-hoc wire protocol. Also, we use for going between our services within a server + across.

On the nodejs services + frontend JS side, there was no real equiv tool for interop -- traditional soln is slowly round-tripping through SQL/ORM or manually doing protobuf through some sort of pubsub -- so Arrow is part of how we have been taming that mess too. It'll be a longer journey for Arrow or something like it to get adopted in JS land, but I can see folks doing TypeScript and serverless wanting it for a cleaner solution to typed data, faster serialization & streaming, etc. (There is no true JS equiv of pandas nor a typed variant.) We were a bit early here b/c we wanted streaming through WebGL + OpenCL/CUDA, and while we are fans of typed data, found protobuf tooling to be too unintegrated and manual in practice.

infinite8s · on July 30, 2020

How successful have you been in pushing database vendors to use arrow on the wire? We've started using turbodbc (to connect to SQL Server), but as I understand it the data on the wire is still in odbc's row-oriented wire format, and turbodbc is responsible for packing into arrow column oriented buffers.

As you mention in your second paragraph, arrow is perfect for getting typed query results from a database server directly to a web frontend with minimal overhead. With the Transferable interface, it's even possible to zero-copy transfer the arrow data from the network buffer to a webworker.

tomrod · on July 27, 2020

To_feather and from_feather are a wonderful addition if you use R and Python, or have a mixed language data science shop.

x87678r · on July 27, 2020

Yes but is it any better than parquet or hdf?

edit - this looks a good explanation, parquet is more widely used and smaller, but slower. https://stackoverflow.com/questions/48083405/what-are-the-di...

kylebarron · on July 27, 2020

The compression improvements in Parquet can be really great, especially for large, wide tables where you might only need a few columns for analysis.

tomrod · on July 27, 2020

Hdf I've not used in any production workflow. Parquet I've really only used with impala. If I'm working in that system I'm in Spark already, typically.

poorman · on July 27, 2020

It also abstracts away optimizations for doing operations. For example tolowercase on a large number of elements.

yingw787 · on July 27, 2020

I wish Excel / Google Sheets supported Parquet export. I've worked with CSV files and Parquet files and found the latter a vast improvement. Chunking data is trivial, you have fixed-size data types, and existing types are well-supported across a multitude of frameworks.

In comparison, CSVs are pretty much unstructured text files with added suggestions.

More pipelines should natively support Parquet files, or something like it, and have a thin CSV to Parquet conversion layer on top.

P.S. I'd love to create a Parquet file visualization tool (something like https://github.com/in3rsha/sha256-animation) in the near future.

e12e · on July 27, 2020

Nice. Is it just me, or is Julia notably missing an up to date arrow implementation?

pyuser583 · on July 27, 2020

Yay! Apache Arrow is great project. Whenever I use is, people think I’m some kind of genius.

It should get a lot more press.

michaelcampbell · on July 27, 2020

The docs are dire. Lots and lots about internals and how it works, and after a few minutes of searching, I've yet to see what it actually does or what I'd want to use it for. The "use cases" are even overly technical things that are building blocks for ... something. What is that something?

kthejoker2 · on July 27, 2020

If you are already doing Big Data / ML / deep learning work in Spark or Dask, Arrow is (with caveats) more or less a drop-in performance optimization (because it's optimized for in-memory vectors) and complexity reducer (because it does all the interop between JVM, Python, TF, etc)

So by itself it doesn't necessarily introduce new use cases that weren't there before, except where performance was a bottleneck.

But we're talking like 500x performance increase on workloads vs ODBC, so night and day for a lot of data scientist pipelines.

asdffdsa · on July 27, 2020

Judging from the faq (https://arrow.apache.org/faq/), it seems like it's a storage format for large structured datasets, designed specifically for use in-memory and for serialization across the network. It provides a spec for the data format, as well as implementations for several programming languages.

That seems pretty straightforward, what part was confusing for you?

michaelcampbell · on July 27, 2020

> what part was confusing for you?

The passive aggressive isn't necessary.

asdffdsa · on July 28, 2020

His snark isn't necessary; the docs are fine.

nwsm · on July 27, 2020

This is the overview; it's high level and informative. https://arrow.apache.org/overview/

It boils down to a standard columnar format usable from disk or memory that is useful for data interchange between languages or frameworks, as well as any use case where a columnar format alone has benefits.

baq · on July 27, 2020

judging from the comments here, it's an in-memory (in-process?) columnar database format without a DBMS attached. sounds useful when sqlite is too slow for your aggregations or if you do processing in X (C, Java) and visualization in Y (R, Python) - just guessing, haven't used it.

zinclozenge · on July 27, 2020

Can you talk about how you use it? At a cursory glance, it seems like it's being used to great effect in databases, or systems designed to query data. Is that what you're using for?

Q6T46nT668w6i3m · on July 27, 2020

An alternative to CSV or SQLite for interchange.

woadwarrior01 · on July 27, 2020

From my experience, Parquet files are a better alternative to CSV or SQLite files for data at rest. I've observed that parquet files tend to be smaller than Feather files (which is essentially the Arrow format in a file) for the same data. Arrow is great for in-memory data, but even with compression, the on-disk size of Feather files tend to be quite a bit bigger than Parquet files.

ttymck · on July 27, 2020

"for interchange" in data science might well mean data on-the-move rather than data at-rest: i.e. a pipeline in two stages, step A in python and step B in R. The data is exchanged (relatively) quickly via disk, and speed beats compression.

woadwarrior01 · on July 27, 2020

Even with fast SSDs, memcpy() is roughly an order of magnitude faster than writing to disk. Using a fast compression algorithm like lz4, zstd, etc can help reduce that gap. And this is invariant of whether the serialization format is Arrow or Parquet. It’s not either, or. If you’re writing to disk, with compression you can have both speed and reduction in the disk footprint.

maximilianburke · on July 27, 2020

Big fan of Arrow (and Parquet) and we use it at UrbanLogiq extensively. It's a wonderful toolset and great for data interop across multiple languages and environments!

thamer · on July 27, 2020

I used Apache Arrow to store and process a relatively large amount of data, 2-3B records with ~10 fields each (still fits on one machine, but it's large enough that I have to think about it a little bit).

I was originally using a simple binary format for fast decoding, and switched to Arrow to be able to select only a few columns at a time. I was impressed by the speed gains, and the size benefits of storing data in column order.

The one thing I wish Arrow had was a way to attach some metadata to its files (if there is, I haven't found it). I originally tried to write a small header before starting to write the Arrow data, but that made it impossible to read back the data: as far as I can tell Arrow looks at the whole file and stores its column definition at the end of the file and computes data offsets based on the start of the file, meaning that there's nowhere left for me to store anything.

It's still a very promising library and I'll definitely be checking out the 1.0 release.

infinite8s · on July 30, 2020

Arrow has two possible formats - a stream based one where the first block is the schema/metadata followed by data blocks (ie record batches), and a file based one where the schema is also stored at the end, along with the index locations of each batch (this also allows random access): https://arrow.apache.org/docs/format/Columnar.html#ipc-strea... vs https://arrow.apache.org/docs/format/Columnar.html#ipc-file-...

lmeyerov · on July 28, 2020

pyarrow.metadata in the header is for stuffing in uninterpreted bytes, so we put in semantic stringified json data there for version numbers etc. that are beyond the data rep typing . Super useful when maturing from localized notebook code to software services standardized on arrow interop.

thamer · on July 28, 2020

Oh cool thanks! I'll check it out.

mindv0rtex · on July 27, 2020

How does this format compare to HDF5, which is common amount scientific developers? Are they meant for a similar use case or not?

tbenst · on July 27, 2020

HDF5 has no support for dataframes/tables, although projects have been built ontop of it like PyTables, but these are non-standard and don’t work across languages.

jakearmitage · on July 27, 2020

So, maybe I'm stupid and never faced enough big data, but what's the advantage of this versus a custom script that queries A and inserts into B?

I do that kind of stuff all the time with Go and it's pretty fast with 20~40 million records, averaging 100 KB each. Are those tools oriented to billions, instead of millions? What are the benefits?

wesm · on July 27, 2020

Arrow:

* Standardizes binary interop and "serialization" of large structured data, removing all conversions / serialization at ingest and export boundaries. This alone can mean > 2-100x performance improvement in an application that processes a lot of data

* The Arrow in-memory format is an ideal data structure to code analytical algorithms against.

Check out my 18min talk from a few years ago about the vision for the project https://www.youtube.com/watch?v=wdmf1msbtVs

boomskats · on July 27, 2020

This is a great talk and should be a top-level comment explaining this release.

dragonsh · on July 27, 2020

Check out the post from author of pandas on pandas, pandas2 and arrow [1].

This might answer your question on what is the significance of arrow, given today pandas are kind of basic ingredient in scientific computing, AI and ML.

[1] https://wesmckinney.com/blog/apache-arrow-pandas-internals/

th0ma5 · on July 27, 2020

I see Arrow as a "data frame server" and data frames have a higher level of abstraction for use in analysis. But you're right that raw processing can be done with most languages natively much faster. But Arrow can let a Python statistician and an R statistician compare apples to apples easily.

cocobro · on July 27, 2020

This article might help, it explains Arrow's performance benefits: https://www.dremio.com/apache-arrow-explained/

yokaze · on July 27, 2020

One major use-case is exchanging data between processes (in different languages) using IPC, avoiding the overhead of serialisation.

Best example is probably pyspark.

lenkite · on July 27, 2020

Can't one do this using protobuf, flatbuffers, etc ? What advantage does arrow offer here ?

wenc · on July 27, 2020

That's precisely it -- you don't have to use protobuf etc. because you don't have to serialize/deserialize at all.

Languages that use Arrow can interact with the source data directly in-memory, in-process. No need to move any data around. The fastest serde operation is one that you don't have to do at all.

lenkite · on July 28, 2020

This is already possible with flatbuffers.

wenc · on July 28, 2020

From the Arrow FAQ

How does Arrow relate to Flatbuffers?

Flatbuffers is a low-level building block for binary data serialization. It is not adapted to the representation of large, structured, homogenous data, and does not sit at the right abstraction layer for data analysis tasks.

Arrow is a data layer aimed directly at the needs of data analysis, providing a comprehensive collection of data types required to analytics, built-in support for “null” values (representing missing data), and an expanding toolbox of I/O and computing facilities.

The Arrow file format does use Flatbuffers under the hood to serialize schemas and other metadata needed to implement the Arrow binary IPC protocol, but the Arrow data format uses its own representation for optimal access and computation.

FridgeSeal · on July 27, 2020

Zero copy serialisation and performance benefits over doing “row-at-a-time” like you would with protobuf.

lenkite · on July 28, 2020

Zero copy serialisation is offered by flatbuffers - and one additionally has the advantage of IDL as a contract between different technology platforms.

oxfordmale · on July 27, 2020

At my company, we query from A and store the data in PyArrow Parquet files on S3. Using Athena this data can then be directly queried. This is a cheap way to store data, that doesn't need the performance of database server.

teleforce · on July 27, 2020

Now Arrow already stable hopefully TileDB can properly support it since they really complement each other [1].

[1]https://news.ycombinator.com/item?id=15561090

glogla · on July 27, 2020

What data types are available in Arrow?

Avro and Parquet can't really do timestamp with timezone, which is a major pain when processing data from geographically distributed locations.

But if Arrow can do it, then maybe there's hope for Parquet as well.

andygrove · on July 28, 2020

You should be able to find this info in the specification: https://arrow.apache.org/docs/format/Columnar.html

dreamcompiler · on July 28, 2020

"Columnar Memory Formats"

Maybe I'm just getting old, but back when we had to melt our own sand to build a computer, we called these things arrays.

erik_seaberg · on July 28, 2020

Arrays of structs fit pretty naturally into many languages, but structs of arrays (and keeping corresponding values in order) are unusual enough to need plumbing that probably isn't there.

billman · on July 27, 2020

Congrats to the team for the 1.0 release.

stilisstuk · on July 28, 2020

Say I have a 50GB (bigger than RAM) CSV file I want to analyse locally with R. Can I somehow use Apache arrow for that?

Niccizero · on July 27, 2020

Now that is has a 1.0 release, it might finally be allowed to have a Wikipedia page.

mamcx · on July 27, 2020

Is this format good for also CRUD-like/mixed workloads?

stokedmartin · on July 27, 2020

noob question -- what are some of the fundamental differences (may or may not help performance) between thrift and apache arrow?

andygrove · on July 28, 2020

Thrift is a serialization format. Arrow is a memory format.

endlessvoid94 · on July 28, 2020

Is this like Memsql back in the day?