Skip to content

Commit 33a3e6f

Browse files
authored
Add Notebook for Loading Data to NestedPandas (#85)
* Add Notebook for Loading Data to NestedPandas * Clear notebook output * Run pre-commit hooks * Address review comments
1 parent 3dea29f commit 33a3e6f

File tree

1 file changed

+289
-0
lines changed

1 file changed

+289
-0
lines changed
Lines changed: 289 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,289 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Loading Data into Nested-Pandas"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"With a valid Python environment, nested-pandas and it's dependencies are easy to install using the `pip` package manager. The following command can be used to install it:"
15+
]
16+
},
17+
{
18+
"cell_type": "code",
19+
"execution_count": null,
20+
"metadata": {},
21+
"outputs": [],
22+
"source": [
23+
"# % pip install nested-pandas"
24+
]
25+
},
26+
{
27+
"cell_type": "code",
28+
"execution_count": null,
29+
"metadata": {},
30+
"outputs": [],
31+
"source": [
32+
"from nested_pandas.datasets import generate_parquet_file\n",
33+
"from nested_pandas import NestedFrame\n",
34+
"from nested_pandas import read_parquet\n",
35+
"\n",
36+
"import os\n",
37+
"import pandas as pd\n",
38+
"import tempfile"
39+
]
40+
},
41+
{
42+
"cell_type": "markdown",
43+
"metadata": {},
44+
"source": [
45+
"# Loading Data from Dictionaries\n",
46+
"Nested-Pandas is tailored towards efficient analysis of nested datasets, and supports loading data from multiple sources.\n",
47+
"\n",
48+
"We can use the `NestedFrame` constructor to create our base frame from a dictionary of our columns.\n",
49+
"\n",
50+
"We can then create an addtional pandas dataframes and pack them into our `NestedFrame` with `NestedFrame.add_nested`"
51+
]
52+
},
53+
{
54+
"cell_type": "code",
55+
"execution_count": null,
56+
"metadata": {},
57+
"outputs": [],
58+
"source": [
59+
"nf = NestedFrame(data={\"a\": [1, 2, 3], \"b\": [2, 4, 6]}, index=[0, 1, 2])\n",
60+
"\n",
61+
"nested = pd.DataFrame(\n",
62+
" data={\"c\": [0, 2, 4, 1, 4, 3, 1, 4, 1], \"d\": [5, 4, 7, 5, 3, 1, 9, 3, 4]},\n",
63+
" index=[0, 0, 0, 1, 1, 1, 2, 2, 2],\n",
64+
")\n",
65+
"\n",
66+
"nf = nf.add_nested(nested, \"nested\")\n",
67+
"nf"
68+
]
69+
},
70+
{
71+
"cell_type": "markdown",
72+
"metadata": {},
73+
"source": [
74+
"# Loading Data from Parquet Files"
75+
]
76+
},
77+
{
78+
"cell_type": "markdown",
79+
"metadata": {},
80+
"source": [
81+
"For larger datasets, we support loading data from parquet files.\n",
82+
"\n",
83+
"In the following cell, we generate a series of temporary parquet files with random data, and ingest them with the `read_parquet` method.\n",
84+
"\n",
85+
"First we load each file individually as its own data frame to be inspected. Then we use `read_parquet` to create the `NestedFrame` `nf`."
86+
]
87+
},
88+
{
89+
"cell_type": "code",
90+
"execution_count": null,
91+
"metadata": {},
92+
"outputs": [],
93+
"source": [
94+
"base_df, nested1, nested2 = None, None, None\n",
95+
"nf = None\n",
96+
"\n",
97+
"# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.\n",
98+
"# You can of course remove this and use your own directory and real files on your system.\n",
99+
"with tempfile.TemporaryDirectory() as temp_path:\n",
100+
" # Generates parquet files with random data within our temporary directorye.\n",
101+
" generate_parquet_file(10, {\"nested1\": 100, \"nested2\": 10}, temp_path, file_per_layer=True)\n",
102+
"\n",
103+
" # Read each individual parquet file into its own dataframe.\n",
104+
" base_df = read_parquet(os.path.join(temp_path, \"base.parquet\"))\n",
105+
" nested1 = read_parquet(os.path.join(temp_path, \"nested1.parquet\"))\n",
106+
" nested2 = read_parquet(os.path.join(temp_path, \"nested2.parquet\"))\n",
107+
"\n",
108+
" # Create a single NestedFrame packing multiple parquet files.\n",
109+
" nf = read_parquet(\n",
110+
" data=os.path.join(temp_path, \"base.parquet\"),\n",
111+
" to_pack={\n",
112+
" \"nested1\": os.path.join(temp_path, \"nested1.parquet\"),\n",
113+
" \"nested2\": os.path.join(temp_path, \"nested2.parquet\"),\n",
114+
" },\n",
115+
" )"
116+
]
117+
},
118+
{
119+
"cell_type": "markdown",
120+
"metadata": {},
121+
"source": [
122+
"When examining the individual tables for each of our parquet files we can see that:\n",
123+
"\n",
124+
"a) they all have different dimensions\n",
125+
"b) they have shared indices"
126+
]
127+
},
128+
{
129+
"cell_type": "code",
130+
"execution_count": null,
131+
"metadata": {},
132+
"outputs": [],
133+
"source": [
134+
"# Print the dimensions of all of our underlying tables\n",
135+
"print(\"Our base table 'base.parquet' has shape:\", base_df.shape)\n",
136+
"print(\"Our first nested table table 'nested1.parquet' has shape:\", nested1.shape)\n",
137+
"print(\"Our second nested table table 'nested2.parquet' has shape:\", nested2.shape)\n",
138+
"\n",
139+
"# Print the unique indices in each table:\n",
140+
"print(\"The unique indices in our base table are:\", base_df.index.values)\n",
141+
"print(\"The unique indices in our first nested table are:\", nested1.index.unique())\n",
142+
"print(\"The unique indices in our second nested table are:\", nested2.index.unique())"
143+
]
144+
},
145+
{
146+
"cell_type": "markdown",
147+
"metadata": {},
148+
"source": [
149+
"So inspect `nf`, a `NestedFrame` we created from our call to `read_parquet` with the `to_pack` argument, we're able to pack nested parquet files according to the shared index values with the index in `base.parquet`.\n",
150+
"\n",
151+
"The resulting `NestedFrame` having the same number of rows as `base.parquet` and with `nested1.parquet` and `nested2.parquet` packed into the 'nested1' and 'nested2' columns respectively."
152+
]
153+
},
154+
{
155+
"cell_type": "code",
156+
"execution_count": null,
157+
"metadata": {},
158+
"outputs": [],
159+
"source": [
160+
"nf"
161+
]
162+
},
163+
{
164+
"cell_type": "markdown",
165+
"metadata": {},
166+
"source": [
167+
"Since we loaded each individual parquet file into its own dataframe, we can also verify that using `read_parquet` with the `to_pack` argument is equivalent to the following method of packing the dataframes directly with `NestedFrame.add_nested`"
168+
]
169+
},
170+
{
171+
"cell_type": "markdown",
172+
"metadata": {},
173+
"source": [
174+
"# Packing Together Existing Dataframes Into a NestedFrame"
175+
]
176+
},
177+
{
178+
"cell_type": "code",
179+
"execution_count": null,
180+
"metadata": {},
181+
"outputs": [],
182+
"source": [
183+
"NestedFrame(base_df).add_nested(nested1, \"nested1\").add_nested(nested2, \"nested2\")"
184+
]
185+
},
186+
{
187+
"cell_type": "markdown",
188+
"metadata": {},
189+
"source": [
190+
"# Saving NestedFrames to Parquet Files\n",
191+
"\n",
192+
"Additionally we can save an existing `NestedFrame` as one of more parquet files using `NestedFrame.to_parquet``\n",
193+
"\n",
194+
"When `by_layer=True` we save each individual layer of the NestedFrame into its own parquet file in a specified output directory.\n",
195+
"\n",
196+
"The base layer will be outputted to \"base.parquet\", and each nested layer will be written to a file based on its column name. So the nested layer in column `nested1` will be written to \"nested1.parquet\"."
197+
]
198+
},
199+
{
200+
"cell_type": "code",
201+
"execution_count": null,
202+
"metadata": {},
203+
"outputs": [],
204+
"source": [
205+
"restored_nf = None\n",
206+
"\n",
207+
"# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.\n",
208+
"# You can of course remove this and use your own directory and real files on your system.\n",
209+
"with tempfile.TemporaryDirectory() as temp_path:\n",
210+
" nf.to_parquet(\n",
211+
" temp_path, # The directory to save our output parquet files.\n",
212+
" by_layer=True, # Save each layer of the NestedFrame to its own parquet file.\n",
213+
" )\n",
214+
"\n",
215+
" # List the files in temp_path to ensure they were saved correctly.\n",
216+
" print(\"The NestedFrame was saved to the following parquet files :\", os.listdir(temp_path))\n",
217+
"\n",
218+
" # Read the NestedFrame back in from our saved parquet files.\n",
219+
" restored_nf = read_parquet(\n",
220+
" data=os.path.join(temp_path, \"base.parquet\"),\n",
221+
" to_pack={\n",
222+
" \"nested1\": os.path.join(temp_path, \"nested1.parquet\"),\n",
223+
" \"nested2\": os.path.join(temp_path, \"nested2.parquet\"),\n",
224+
" },\n",
225+
" )\n",
226+
"\n",
227+
"restored_nf # our dataframe is restored from our saved parquet files"
228+
]
229+
},
230+
{
231+
"cell_type": "markdown",
232+
"metadata": {},
233+
"source": [
234+
"We also support saving a `NestedFrame` as a single parquet file where the packed layers are still packed in their respective columns.\n",
235+
"\n",
236+
"Here we provide `NestedFrame.to_parquet` with the desired path of the *single* output file (rather than the path of a directory to store *multiple* output files) and use `per_layer=False'\n",
237+
"\n",
238+
"Our `read_parquet` function can load a `NestedFrame` saved in this single file parquet without requiring any additional arguments. "
239+
]
240+
},
241+
{
242+
"cell_type": "code",
243+
"execution_count": null,
244+
"metadata": {},
245+
"outputs": [],
246+
"source": [
247+
"restored_nf_single_file = None\n",
248+
"\n",
249+
"# Note: that we use the `tempfile` module to create and then cleanup a temporary directory.\n",
250+
"# You can of course remove this and use your own directory and real files on your system.\n",
251+
"with tempfile.TemporaryDirectory() as temp_path:\n",
252+
" output_path = os.path.join(temp_path, \"output.parquet\")\n",
253+
" nf.to_parquet(\n",
254+
" output_path, # The filename to save our NestedFrame to.\n",
255+
" by_layer=False, # Save the entire NestedFrame to a single parquet file.\n",
256+
" )\n",
257+
"\n",
258+
" # List the files within our temp_path to ensure that we only saved a single parquet file.\n",
259+
" print(\"The NestedFrame was saved to the following parquet files :\", os.listdir(temp_path))\n",
260+
"\n",
261+
" # Read the NestedFrame back in from our saved single parquet file.\n",
262+
" restored_nf_single_file = read_parquet(output_path)\n",
263+
"\n",
264+
"restored_nf_single_file # our dataframe is restored from a single saved parquet file"
265+
]
266+
}
267+
],
268+
"metadata": {
269+
"kernelspec": {
270+
"display_name": "Python 3",
271+
"language": "python",
272+
"name": "python3"
273+
},
274+
"language_info": {
275+
"codemirror_mode": {
276+
"name": "ipython",
277+
"version": 3
278+
},
279+
"file_extension": ".py",
280+
"mimetype": "text/x-python",
281+
"name": "python",
282+
"nbconvert_exporter": "python",
283+
"pygments_lexer": "ipython3",
284+
"version": "3.11.9"
285+
}
286+
},
287+
"nbformat": 4,
288+
"nbformat_minor": 2
289+
}

0 commit comments

Comments
 (0)