Skip to content

Commit

Permalink
separate packages
Browse files Browse the repository at this point in the history
  • Loading branch information
Han Wang authored Mar 25, 2022
1 parent c175e47 commit 9ba8081
Show file tree
Hide file tree
Showing 340 changed files with 115 additions and 259 deletions.
64 changes: 55 additions & 9 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,15 +33,15 @@ jobs:
- name: Install
run: |
pip install -e .
make installlocal
- name: Test
run: |
pytest
make test
#-------------------------------------------------------------------------------
build_wheels:
name: Build wheels on ${{ matrix.os }} ${{ matrix.cpv }}
build_cpp_wheels:
name: Build CPP wheels on ${{ matrix.os }} ${{ matrix.cpv }}
runs-on: ${{ matrix.os }}
if: github.event_name == 'release' && github.event.action == 'published' && !github.event.release.prerelease

Expand Down Expand Up @@ -76,15 +76,17 @@ jobs:
CIBW_BUILD: ${{ matrix.cpv }}
# Don't build osx 10.6 (no C++11 support)
CIBW_SKIP: "*macosx_10_6*"
FUGUE_SQL_BUILD_CPP: 1
run: |
python -m cibuildwheel --output-dir wheelhouse
- uses: actions/upload-artifact@v2
with:
name: cpp
path: ./wheelhouse/*.whl

#-------------------------------------------------------------------------------
build_sdist:
build_cpp_sdist:
name: Build source distribution
runs-on: ubuntu-latest
steps:
Expand All @@ -96,18 +98,62 @@ jobs:
python-version: 3.8

- name: Build sdist
env:
FUGUE_SQL_BUILD_CPP: 1
run: python setup.py sdist

- uses: actions/upload-artifact@v2
with:
name: cpp
path: dist/*.tar.gz

build_py_sdist:
name: Build source distribution
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2

- uses: actions/setup-python@v2
name: Install Python
with:
python-version: 3.8

- name: Build sdist
env:
FUGUE_SQL_BUILD_CPP: 0
run: python setup.py sdist

- uses: actions/upload-artifact@v2
with:
name: py
path: dist/*.tar.gz

#-------------------------------------------------------------------------------
deploy:
deploy_cpp:
needs:
- test
- build_cpp_wheels
- build_cpp_sdist

runs-on: ubuntu-latest

# Only publish when a GitHub Release is created.
if: github.event_name == 'release' && github.event.action == 'published'
steps:
- uses: actions/download-artifact@v2
with:
name: cpp
path: dist

- uses: pypa/gh-action-pypi-publish@master
with:
user: __token__
password: ${{ secrets.PYPI_TOKEN }}

deploy_py:
needs:
- test
- build_wheels
- build_sdist
- build_py_sdist

runs-on: ubuntu-latest

Expand All @@ -116,7 +162,7 @@ jobs:
steps:
- uses: actions/download-artifact@v2
with:
name: artifact
name: py
path: dist

- uses: pypa/gh-action-pypi-publish@master
Expand Down
3 changes: 2 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ exclude: |
^tests/|
^docs/|
^scripts/|
^fugue_sql_antlr/_parser/
^fugue_sql_antlr/_parser/|
^fugue_sql_antlr_cpp/
)
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
Expand Down
2 changes: 0 additions & 2 deletions MANIFEST.in

This file was deleted.

2 changes: 2 additions & 0 deletions MANIFEST_.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
recursive-include fugue_sql_antlr_cpp *
recursive-exclude tests *
28 changes: 5 additions & 23 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -42,29 +42,11 @@ jupyter:
test:
python3 -bb -m pytest --reruns 2 --only-rerun 'Overflow in cast' --only-rerun 'Table or view not found' tests/

testcore:
python3 -bb -m pytest tests/fugue

testspark:
python3 -bb -m pytest --reruns 2 --only-rerun 'Table or view not found' tests/fugue_spark

testdask:
python3 -bb -m pytest tests/fugue_dask

testduck:
python3 -bb -m pytest --reruns 2 --only-rerun 'Overflow in cast' tests/fugue_duckdb

testsql:
python3 -bb -m pytest tests/fugue_sql

testibis:
python3 -bb -m pytest tests/fugue_ibis

testnotebook:
pip install .
jupyter nbextension install --user --py fugue_notebook
jupyter nbextension enable fugue_notebook --py
jupyter nbconvert --execute --clear-output tests/fugue_notebook/test_notebook.ipynb
installlocal:
export FUGUE_SQL_BUILD_CPP=0
pip install -e .
export FUGUE_SQL_BUILD_CPP=1
pip install -e .

sql:
scripts/generate_code.sh
196 changes: 2 additions & 194 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,198 +10,6 @@
| --- | --- | --- |
| [![Doc](https://readthedocs.org/projects/fugue/badge)](https://fugue.readthedocs.org) | [![Jupyter Book Badge](https://jupyterbook.org/badge.svg)](https://fugue-tutorials.readthedocs.io/) | [![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](https://join.slack.com/t/fugue-project/shared_invite/zt-jl0pcahu-KdlSOgi~fP50TZWmNxdWYQ) |

# Fugue SQL Antlr Parser

**Fugue is a unified interface for distributed computing that lets users execute Python, pandas, and SQL code on Spark and Dask without rewrites**. It is meant for:

* Data scientists/analysts who want to **focus on defining logic rather than worrying about execution**
* SQL-lovers wanting to use **SQL to define end-to-end workflows** in pandas, Spark, and Dask
* Data scientists using pandas wanting to take advantage of **Spark or Dask** with minimal effort
* Big data practitioners finding **testing code** to be costly and slow
* Data teams with big data projects that **struggle maintaining code**

## Select Features

* **Cross-framework code**: Write code once in native Python, SQL, or pandas then execute it on Dask or Spark with no rewrites. Logic and execution are decoupled through Fugue, enabling users to leverage the Spark and Dask engines without learning the specific framework syntax.
* **Rapid iterations for big data projects**: Test code on smaller data, then reliably scale to Dask or Spark when ready. This accelerates project iteration time and reduces expensive mistakes.
* **Friendlier interface for Spark**: Users can get Python/pandas code running on Spark with significantly less effort compared to PySpark. FugueSQL extends SparkSQL to be a more complete programming language.
* **Highly testable code**: Fugue makes logic more testable because all code is written in native Python. Unit tests scale seamlessly from local workflows to distributed computing workflows.

## Fugue Transform

The simplest way to use Fugue is the [`transform()` function](https://fugue-tutorials.readthedocs.io/tutorials/beginner/introduction.html#fugue-transform). This lets users parallelize the execution of a single function by bringing it to Spark or Dask. In the example below, the `map_letter_to_food()` function takes in a mapping and applies it on a column. This is just pandas and Python so far (without Fugue).

```python
import pandas as pd
from typing import Dict

input_df = pd.DataFrame({"id":[0,1,2], "value": (["A", "B", "C"])})
map_dict = {"A": "Apple", "B": "Banana", "C": "Carrot"}

def map_letter_to_food(df: pd.DataFrame, mapping: Dict[str, str]) -> pd.DataFrame:
df["value"] = df["value"].map(mapping)
return df
```

Now, the `map_letter_to_food()` function is brought to the Spark execution engine by invoking the `transform` function of Fugue. The output `schema`, `params` and `engine` are passed to the `transform()` call. The `schema` is needed because it's a requirement on Spark. A schema of `"*"` below means all input columns are in the output.

```python
from fugue import transform
from fugue_spark import SparkExecutionEngine

df = transform(input_df,
map_letter_to_food,
schema="*",
params=dict(mapping=map_dict),
engine=SparkExecutionEngine
)
df.show()
```
```rst
+---+------+
| id| value|
+---+------+
| 0| Apple|
| 1|Banana|
| 2|Carrot|
+---+------+
```

<details>
<summary>PySpark equivalent of Fugue transform</summary>

```python
from typing import Iterator, Union
from pyspark.sql.types import StructType
from pyspark.sql import DataFrame, SparkSession

spark_session = SparkSession.builder.getOrCreate()

def mapping_wrapper(dfs: Iterator[pd.DataFrame], mapping):
for df in dfs:
yield map_letter_to_food(df, mapping)

def run_map_letter_to_food(input_df: Union[DataFrame, pd.DataFrame], mapping):
# conversion
if isinstance(input_df, pd.DataFrame):
sdf = spark_session.createDataFrame(input_df.copy())
else:
sdf = input_df.copy()

schema = StructType(list(sdf.schema.fields))
return sdf.mapInPandas(lambda dfs: mapping_wrapper(dfs, mapping),
schema=schema)

result = run_map_letter_to_food(input_df, map_dict)
result.show()
```
</details>

This syntax is simpler, cleaner, and more maintainable than the PySpark equivalent. At the same time, no edits were made to the original pandas-based function to bring it to Spark. It is still usable on pandas DataFrames. Because the Spark execution engine was used, the returned `df` is now a Spark DataFrame. Fugue `transform()` also supports `DaskExecutionEngine` and the pandas-based `NativeExecutionEngine`.

## [FugueSQL](https://fugue-tutorials.readthedocs.io/tutorials/fugue_sql/index.html)

A SQL-based language capable of expressing end-to-end workflows. The `map_letter_to_food()` function above is used in the SQL expression below. This is how to use a Python-defined transformer along with the standard SQL `SELECT` statement.

```python
from fugue_sql import fsql
import json

query = """
SELECT id, value FROM input_df
TRANSFORM USING map_letter_to_food(mapping={{mapping}}) SCHEMA *
PRINT
"""
map_dict_str = json.dumps(map_dict)

fsql(query,mapping=map_dict_str).run()
```

For FugueSQL, we can change the engine by passing it to the `run()` method: `fsql(query,mapping=map_dict_str).run("spark")`.

## Jupyter Notebook Extension

There is an accompanying notebook extension for FugueSQL that lets users use the `%%fsql` cell magic. The extension also provides syntax highlighting for FugueSQL cells. (Syntax highlighting is not available yet for JupyterLab).

![FugueSQL gif](https://miro.medium.com/max/700/1*6091-RcrOPyifJTLjo0anA.gif)

The notebook environment can be setup by using the `setup()` function as follows in the first cell of a notebook:

```python
from fugue_notebook import setup
setup()
```

Note that you can automatically load `fugue_notebook` iPython extension at startup,
read [this](https://ipython.readthedocs.io/en/stable/config/extensions/#using-extensions) to configure your Jupyter environment.


## Installation

Fugue can be installed through pip by using:

```bash
pip install fugue
```

It also has the following extras:

* **sql**: to support [FugueSQL](https://fugue-tutorials.readthedocs.io/tutorials/fugue_sql/index.html)
* **spark**: to support Spark as the [ExecutionEngine](https://fugue-tutorials.readthedocs.io/tutorials/advanced/execution_engine.html)
* **dask**: to support Dask as the [ExecutionEngine](https://fugue-tutorials.readthedocs.io/tutorials/advanced/execution_engine.html)
* **all**: install everything above

For example a common use case is:

```bash
pip install fugue[sql,spark]
```

To install the notebook extension (after installing Fugue):

```bash
jupyter nbextension install --py fugue_notebook
jupyter nbextension enable fugue_notebook --py
```

## [Getting Started](https://fugue-tutorials.readthedocs.io/)

The best way to get started with Fugue is to work through the [tutorials](https://fugue-tutorials.readthedocs.io/).

The tutorials can also be run in an interactive notebook environment through binder or Docker:

### Using binder

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/fugue-project/tutorials/master)

**Note it runs slow on binder** because the machine on binder isn't powerful enough for a distributed framework such as Spark. Parallel executions can become sequential, so some of the performance comparison examples will not give you the correct numbers.

### Using Docker

Alternatively, you should get decent performance by running this Docker image on your own machine:

```bash
docker run -p 8888:8888 fugueproject/tutorials:latest
```

For the API docs, [click here](https://fugue.readthedocs.org)

## Further Resources

View some of our latest conferences presentations and content. For a more complete list, check the [Resources](https://fugue-tutorials.readthedocs.io/en/latest/tutorials/resources.html) page in the tutorials.

### Blogs

* [Fugue: Reducing Spark Developer Friction (James Le)](https://jameskle.com/writes/fugue)
* [Introducing FugueSQL — SQL for Pandas, Spark, and Dask DataFrames (Towards Data Science by Khuyen Tran)](https://towardsdatascience.com/introducing-fuguesql-sql-for-pandas-spark-and-dask-dataframes-63d461a16b27)
* [Interoperable Python and SQL in Jupyter Notebooks (Towards Data Science)](https://towardsdatascience.com/interoperable-python-and-sql-in-jupyter-notebooks-86245e711352)
* [Using Pandera on Spark for Data Validation through Fugue (Towards Data Science)](https://towardsdatascience.com/using-pandera-on-spark-for-data-validation-through-fugue-72956f274793)

### Conferences

* [Large Scale Data Validation with Spark and Dask (PyCon US 2021)](https://www.youtube.com/watch?v=2AdvBgjO_3Q)
* [Dask SQL Query Engines (Dask Summit 2021)](https://www.youtube.com/watch?v=bQDN41Bc3bw)
* [Scaling Machine Learning Workflows to Big Data with Fugue (KubeCon 2021)](https://www.youtube.com/watch?v=fDIRMiwc0aA)

## Community and Contributing

Feel free to message us on [Slack](https://join.slack.com/t/fugue-project/shared_invite/zt-jl0pcahu-KdlSOgi~fP50TZWmNxdWYQ). We also have [contributing instructions](CONTRIBUTING.md).
This is the dedicated package for the Fugue SQL parser built on Antlr4.
2 changes: 1 addition & 1 deletion fugue_sql_antlr/_parser/sa_fugue_sql.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ def parse(stream:InputStream, entry_rule_name:str, sa_err_listener:SA_ErrorListe
#-------------------------------------------------------------------------------

try:
from . import sa_fugue_sql_cpp_parser
from fugue_sql_antlr_cpp import sa_fugue_sql_cpp_parser
except ImportError:
USE_CPP_IMPLEMENTATION = False

Expand Down
Empty file added fugue_sql_antlr_cpp/__init__.py
Empty file.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Loading

0 comments on commit 9ba8081

Please sign in to comment.