-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Discussion] Add Avro library #17
Comments
I'm surprised that Avro C++ depends so much on Boost. Our experience in Arrow C++ has been that a Boost dependency is a headache for a library (such as Arrow or Avro), and since then we have completely abandoned Boost apart from unit tests.
|
For PyIceberg we made a bold decision to re-implement the Avro reader/writer which was pretty fun to implement. We also made the reader rely on an Iceberg schema, rather than an Avro schema. This gave some nice advantages, such as native field-ID-based projections and initial defaults that makes it easier to project Iceberg V1 metadata into V2. |
cc @wgtmac I know that he was also working on removing boost from the C++ library. |
In Python or in C++? :) |
In Cython :) The performance of the zig-zag (de/en)coding of ints was pretty terrible in Python. |
We only care about the binary encoding of Avro, right? |
Yes, that's correct 👍 |
I was thinking about picking some files we need from Avro and implement our own. Since PyIceberg goes this way, I will try to dig more in this direction, thanks for the inputs :) |
I have created https://issues.apache.org/jira/browse/AVRO-4095 to work on removing the boost dependency from non-testing code path (at least binary-encoding). I'm inclined to use the C++ library with boost dependency removed. I happened to see that DuckDB has leveraged the Avro C library for its Avro extension: https://duckdb.org/2024/12/09/duckdb-avro-extension.html. So I don't think we need to reinvent a Avro C++ library at the moment.
I don't think it is a good approach. We can define a good interface for Avro reader/writer in the iceberg-core library and provide a default implementation via iceberg-avro library with help of the open source Avro C++ library. |
Yeah, removing the boost dependency is better, will keep an eye on it, thanks |
I studied the avro c++ code base a little bit, it seems to me that the boost/iostreams dependency in DataFile.cc is the most difficult part to get rid of, do you have any solution in mind? |
Yes, I also share the same feeling. Boost dependencies listed below are tightly used for gzip/zlib codec. I think it needs non-trivial work to replace them with the standard zlib library.
I think we can use macro to enable/disable compiling the zlib codec in the beginning so that we can get rid of boost/iostreams without zlib codec in iceberg-cpp. Later we can add zlib support (like the snappy implementation) to the Avro C++ library. |
Update: I have removed most of the Boost dependency from Apache Avro C++ library: https://issues.apache.org/jira/browse/AVRO-4095. Now the one last thing is to replace the Boost zlib/gzip implementation with the one from https://zlib.net/ |
After apache/avro#3293 is merged, I think we are ready to use it. @zhjwpku |
Great, will keep an eye on it, do you want me to import avro to iceberg-cpp or you do it yourself? |
I believe I have a lot of TODOs on my end, especially the internal project from my employer :) It would be great if you have time to do it. |
Ok, I will prepare a PR later ;) |
This closes apache#17 The avro_unittest is added for testing if libavrocpp works, once we supported Manifest, this should be removed. Signed-off-by: Junwang Zhao <[email protected]>
This closes apache#17 The avro_unittest is added for testing if libavrocpp works, once we supported Manifest, this should be removed. Signed-off-by: Junwang Zhao <[email protected]>
This closes apache#17 The avro_unittest is added for testing if libavrocpp works, once we supported Manifest, this should be removed. Signed-off-by: Junwang Zhao <[email protected]>
This closes apache#17 The avro_unittest is added for testing if libavrocpp works, once we supported Manifest, this should be removed. Signed-off-by: Junwang Zhao <[email protected]>
This closes apache#17 The avro_unittest is added for testing if libavrocpp works, once we supported Manifest, this should be removed. Signed-off-by: Junwang Zhao <[email protected]>
This closes apache#17 The avro_unittest is added for testing if libavrocpp works, once we supported Manifest, this should be removed. Signed-off-by: Junwang Zhao <[email protected]>
Iceberg's manifests files are stored using Apache Avro, I just give a quick look at the avro's code base, there is a C++ implementation and a C implementation, as the document stated,
C implementation has a dependence lib jansson while C++ implementation need Boost
I'd like to know your opinions about what's the best way to integrate Avro into iceberg-cpp.
[0] https://avro.apache.org/docs/1.12.0/api/c/
[1] https://avro.apache.org/docs/1.12.0/api/cpp/html/#Installing
The text was updated successfully, but these errors were encountered: