Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Add Avro library #17

Closed
zhjwpku opened this issue Dec 18, 2024 · 17 comments · Fixed by #34
Closed

[Discussion] Add Avro library #17

zhjwpku opened this issue Dec 18, 2024 · 17 comments · Fixed by #34

Comments

@zhjwpku
Copy link
Collaborator

zhjwpku commented Dec 18, 2024

Iceberg's manifests files are stored using Apache Avro, I just give a quick look at the avro's code base, there is a C++ implementation and a C implementation, as the document stated,

  • C implementation has been tested on MacOSX and Linux, not mentioning Windows
  • C++ implementation seems support all three OS platforms we are targeting.

C implementation has a dependence lib jansson while C++ implementation need Boost

I'd like to know your opinions about what's the best way to integrate Avro into iceberg-cpp.

[0] https://avro.apache.org/docs/1.12.0/api/c/
[1] https://avro.apache.org/docs/1.12.0/api/cpp/html/#Installing

@pitrou
Copy link
Member

pitrou commented Dec 18, 2024

I'm surprised that Avro C++ depends so much on Boost. Our experience in Arrow C++ has been that a Boost dependency is a headache for a library (such as Arrow or Avro), and since then we have completely abandoned Boost apart from unit tests.

Apart from the header-only libraries of Boost, Avro C++ requires filesystem, iostreams, system and program_options libraries.

@Fokko
Copy link
Contributor

Fokko commented Dec 18, 2024

For PyIceberg we made a bold decision to re-implement the Avro reader/writer which was pretty fun to implement. We also made the reader rely on an Iceberg schema, rather than an Avro schema. This gave some nice advantages, such as native field-ID-based projections and initial defaults that makes it easier to project Iceberg V1 metadata into V2.

@Fokko
Copy link
Contributor

Fokko commented Dec 18, 2024

cc @wgtmac I know that he was also working on removing boost from the C++ library.

@pitrou
Copy link
Member

pitrou commented Dec 18, 2024

For PyIceberg we made a bold decision to re-implement the Avro reader/writer which was pretty fun to implement.

In Python or in C++? :)

@Fokko
Copy link
Contributor

Fokko commented Dec 18, 2024

In Cython :) The performance of the zig-zag (de/en)coding of ints was pretty terrible in Python.

@pitrou
Copy link
Member

pitrou commented Dec 18, 2024

We only care about the binary encoding of Avro, right?

@Fokko
Copy link
Contributor

Fokko commented Dec 18, 2024

Yes, that's correct 👍

@zhjwpku
Copy link
Collaborator Author

zhjwpku commented Dec 18, 2024

I was thinking about picking some files we need from Avro and implement our own. Since PyIceberg goes this way, I will try to dig more in this direction, thanks for the inputs :)

@wgtmac
Copy link
Member

wgtmac commented Dec 18, 2024

I have created https://issues.apache.org/jira/browse/AVRO-4095 to work on removing the boost dependency from non-testing code path (at least binary-encoding). I'm inclined to use the C++ library with boost dependency removed. I happened to see that DuckDB has leveraged the Avro C library for its Avro extension: https://duckdb.org/2024/12/09/duckdb-avro-extension.html. So I don't think we need to reinvent a Avro C++ library at the moment.

I was thinking about picking some files we need from Avro and implement our own.

I don't think it is a good approach. We can define a good interface for Avro reader/writer in the iceberg-core library and provide a default implementation via iceberg-avro library with help of the open source Avro C++ library.

@zhjwpku
Copy link
Collaborator Author

zhjwpku commented Dec 19, 2024

I have created https://issues.apache.org/jira/browse/AVRO-4095 to work on removing the boost dependency from non-testing code path (at least binary-encoding). I'm inclined to use the C++ library with boost dependency removed.

Yeah, removing the boost dependency is better, will keep an eye on it, thanks

@zhjwpku
Copy link
Collaborator Author

zhjwpku commented Dec 28, 2024

I have created https://issues.apache.org/jira/browse/AVRO-4095 to work on removing the boost dependency from non-testing code path (at least binary-encoding). I'm inclined to use the C++ library with boost dependency removed.

I studied the avro c++ code base a little bit, it seems to me that the boost/iostreams dependency in DataFile.cc is the most difficult part to get rid of, do you have any solution in mind?

@wgtmac
Copy link
Member

wgtmac commented Dec 28, 2024

Yes, I also share the same feeling. Boost dependencies listed below are tightly used for gzip/zlib codec. I think it needs non-trivial work to replace them with the standard zlib library.

boost/crc.hpp
boost/iostreams/device/file.hpp
boost/iostreams/filter/gzip.hpp
boost/iostreams/filter/zlib.hpp
boost/iostreams/filtering_stream.hpp

I think we can use macro to enable/disable compiling the zlib codec in the beginning so that we can get rid of boost/iostreams without zlib codec in iceberg-cpp. Later we can add zlib support (like the snappy implementation) to the Avro C++ library.

@wgtmac
Copy link
Member

wgtmac commented Jan 14, 2025

Update: I have removed most of the Boost dependency from Apache Avro C++ library: https://issues.apache.org/jira/browse/AVRO-4095. Now the one last thing is to replace the Boost zlib/gzip implementation with the one from https://zlib.net/

@wgtmac
Copy link
Member

wgtmac commented Jan 17, 2025

After apache/avro#3293 is merged, I think we are ready to use it. @zhjwpku

@zhjwpku
Copy link
Collaborator Author

zhjwpku commented Jan 17, 2025

After apache/avro#3293 is merged, I think we are ready to use it. @zhjwpku

Great, will keep an eye on it, do you want me to import avro to iceberg-cpp or you do it yourself?

@wgtmac
Copy link
Member

wgtmac commented Jan 17, 2025

I believe I have a lot of TODOs on my end, especially the internal project from my employer :)

It would be great if you have time to do it.

@zhjwpku
Copy link
Collaborator Author

zhjwpku commented Jan 17, 2025

I believe I have a lot of TODOs on my end, especially the internal project from my employer :)

It would be great if you have time to do it.

Ok, I will prepare a PR later ;)

zhjwpku added a commit to zhjwpku/iceberg-cpp that referenced this issue Jan 18, 2025
This closes apache#17

The avro_unittest is added for testing if libavrocpp works, once
we supported Manifest, this should be removed.

Signed-off-by: Junwang Zhao <[email protected]>
zhjwpku added a commit to zhjwpku/iceberg-cpp that referenced this issue Jan 18, 2025
This closes apache#17

The avro_unittest is added for testing if libavrocpp works, once
we supported Manifest, this should be removed.

Signed-off-by: Junwang Zhao <[email protected]>
zhjwpku added a commit to zhjwpku/iceberg-cpp that referenced this issue Jan 18, 2025
This closes apache#17

The avro_unittest is added for testing if libavrocpp works, once
we supported Manifest, this should be removed.

Signed-off-by: Junwang Zhao <[email protected]>
zhjwpku added a commit to zhjwpku/iceberg-cpp that referenced this issue Jan 18, 2025
This closes apache#17

The avro_unittest is added for testing if libavrocpp works, once
we supported Manifest, this should be removed.

Signed-off-by: Junwang Zhao <[email protected]>
zhjwpku added a commit to zhjwpku/iceberg-cpp that referenced this issue Jan 18, 2025
This closes apache#17

The avro_unittest is added for testing if libavrocpp works, once
we supported Manifest, this should be removed.

Signed-off-by: Junwang Zhao <[email protected]>
zhjwpku added a commit to zhjwpku/iceberg-cpp that referenced this issue Jan 18, 2025
This closes apache#17

The avro_unittest is added for testing if libavrocpp works, once
we supported Manifest, this should be removed.

Signed-off-by: Junwang Zhao <[email protected]>
@Fokko Fokko closed this as completed in #34 Jan 30, 2025
@Fokko Fokko closed this as completed in c0bf8ce Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants