diff --git a/README.md b/README.md index c9836f2..e1856e6 100644 --- a/README.md +++ b/README.md @@ -1,225 +1,280 @@ -# Atio ๐Ÿ›ก๏ธ +
-์•ˆ์ „ํ•˜๊ณ  **์›์ž์ ์ธ ํŒŒ์ผ ์“ฐ๊ธฐ**๋ฅผ ์ง€์›ํ•˜๋Š” ๊ฒฝ๋Ÿ‰ Python ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค. -Pandas, Polars, NumPy ๋“ฑ ๋ฐ์ดํ„ฐ ๊ฐ์ฒด ์ €์žฅ ์‹œ **ํŒŒ์ผ ์†์ƒ ์—†์ด**, **ํŠธ๋žœ์žญ์…˜์ฒ˜๋Ÿผ ์•ˆ์ „ํ•˜๊ฒŒ ์ฒ˜๋ฆฌ**ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. + ---- +Python library for safe atomic file writing and database writing
+๐Ÿš€ `pip install atio` -## ๐ŸŒŸ ์ฃผ์š” ๊ธฐ๋Šฅ +[![Python](https://img.shields.io/badge/Python-3.7+-blue.svg)](https://www.python.org/downloads/) +[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE) +[![PyPI](https://img.shields.io/badge/PyPI-2.0.0-orange.svg)](https://pypi.org/project/atio/) -- โœ… ์ž„์‹œ ๋””๋ ‰ํ† ๋ฆฌ ์Šคํ…Œ์ด์ง• ํ›„ **์›์ž์  ํŒŒ์ผ ๊ต์ฒด** -- ๐Ÿ“ฆ Pandas, Polars, NumPy ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ๊ฐ์ฒด ์ง€์› -- ๐Ÿ“ `_SUCCESS` ํ”Œ๋ž˜๊ทธ ํŒŒ์ผ ์ƒ์„ฑ โ€” ์ €์žฅ ์™„๋ฃŒ ์—ฌ๋ถ€ ํ‘œ์‹œ -- ๐Ÿ›  ์‹คํŒจ ์‹œ **์›๋ณธ ํŒŒ์ผ ๋ณด์กด**, ์ž„์‹œ ํŒŒ์ผ ์ž๋™ ์ •๋ฆฌ -- ๐Ÿงฉ ํ”Œ๋Ÿฌ๊ทธ์ธ ์•„ํ‚คํ…์ฒ˜๋กœ **ํ™•์žฅ์„ฑ ์ข‹์Œ** -- ๐Ÿ” **์„ฑ๋Šฅ ์ง„๋‹จ ๋กœ๊น…** โ€” ๊ฐ ๋‹จ๊ณ„๋ณ„ ์‹คํ–‰ ์‹œ๊ฐ„ ์ธก์ • ๋ฐ ๋ณ‘๋ชฉ์  ๋ถ„์„ +
--- -## ๐Ÿ” ์„ฑ๋Šฅ ์ง„๋‹จ ๋กœ๊น… (NEW!) +## ๐Ÿ“– Overview + +Atio is a Python library that prevents data loss and ensures safe file writing. Through atomic writing, it protects existing data even when errors occur during file writing, and supports various data formats and database connections. + +### โœจ Key Features + +- ๐Ÿ”’ **Atomic File Writing**: Safe writing using temporary files +- ๐Ÿ“Š **Multiple Format Support**: CSV, Parquet, Excel, JSON, etc. +- ๐Ÿ—„๏ธ **Database Support**: Direct SQL and Database writing +- ๐Ÿ“ˆ **Progress Display**: Progress monitoring for large data processing +- ๐Ÿ”„ **Rollback Function**: Automatic recovery when errors occur +- ๐ŸŽฏ **Simple API**: Intuitive and easy-to-use interface +- ๐Ÿ“‹ **Version Management**: Snapshot-based data version management +- ๐Ÿงน **Auto Cleanup**: Automatic deletion of old data + +## ๐Ÿš€ Installation + +```bash +pip install atio +``` + +## ๐Ÿ“š Usage -Atio๋Š” ์ด์ œ **์„ฑ๋Šฅ ์ง„๋‹จ ๋กœ๊น…** ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. `verbose=True` ์˜ต์…˜์„ ์‚ฌ์šฉํ•˜๋ฉด ๊ฐ ๋‹จ๊ณ„๋ณ„ ์‹คํ–‰ ์‹œ๊ฐ„์„ ์ธก์ •ํ•˜์—ฌ ๋ณ‘๋ชฉ์ ์„ ์ •ํ™•ํžˆ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. +### `atio.write()` - Basic File/Database Writing + +**Purpose**: Save data to a single file or database + +**Key Parameters**: +- `obj`: Data to save (pandas.DataFrame, polars.DataFrame, numpy.ndarray) +- `target_path`: File save path (required for file writing) +- `format`: Save format ('csv', 'parquet', 'excel', 'json', 'sql', 'database') +- `show_progress`: Whether to display progress +- `verbose`: Whether to output detailed performance information + +#### Basic File Writing -### ๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ• (๊ฐ„๋‹จํ•œ ์ •๋ณด๋งŒ): ```python -import atio as aw +import atio import pandas as pd -df = pd.DataFrame({"a": [1, 2, 3]}) +df = pd.DataFrame({ + "name": ["Alice", "Bob", "Charlie"], + "age": [25, 30, 35], + "city": ["Seoul", "Busan", "Incheon"] +}) -# ๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ• - ๊ฐ„๋‹จํ•œ ์„ฑ๊ณต/์‹คํŒจ ์ •๋ณด๋งŒ -aw.write(df, "output.parquet", format="parquet") +# Save in various formats +atio.write(df, "users.parquet", format="parquet") +atio.write(df, "users.csv", format="csv", index=False) +atio.write(df, "users.xlsx", format="excel", sheet_name="Users") ``` -**์ถœ๋ ฅ ์˜ˆ์‹œ:** -``` -[INFO] ์ž„์‹œ ๋””๋ ‰ํ† ๋ฆฌ ์ƒ์„ฑ: /tmp/tmp_xxx -[INFO] ์ž„์‹œ ํŒŒ์ผ ๊ฒฝ๋กœ: /tmp/tmp_xxx/output.parquet -[INFO] ์‚ฌ์šฉํ•  writer: to_parquet (format: parquet) -[INFO] ๋ฐ์ดํ„ฐ ์ž„์‹œ ํŒŒ์ผ์— ์ €์žฅ ์™„๋ฃŒ: /tmp/tmp_xxx/output.parquet -[INFO] ์›์ž์  ๊ต์ฒด ์™„๋ฃŒ: /tmp/tmp_xxx/output.parquet -> output.parquet -[INFO] _SUCCESS ํ”Œ๋ž˜๊ทธ ํŒŒ์ผ ์ƒ์„ฑ: output.parquet._SUCCESS -[INFO] Atomic write completed successfully (took 0.2359s) -``` +#### Database Writing -### ์ƒ์„ธ ์ง„๋‹จ ๋ชจ๋“œ (verbose=True): ```python -# ์ƒ์„ธํ•œ ์„ฑ๋Šฅ ์ง„๋‹จ ์ •๋ณด ์ถœ๋ ฅ -aw.write(df, "output.parquet", format="parquet", verbose=True) -``` +import atio +import pandas as pd +from sqlalchemy import create_engine -**์ถœ๋ ฅ ์˜ˆ์‹œ:** -``` -[INFO] ์ž„์‹œ ๋””๋ ‰ํ† ๋ฆฌ ์ƒ์„ฑ: /tmp/tmp_xxx -[INFO] ์ž„์‹œ ํŒŒ์ผ ๊ฒฝ๋กœ: /tmp/tmp_xxx/output.parquet -[INFO] ์‚ฌ์šฉํ•  writer: to_parquet (format: parquet) -[INFO] ๋ฐ์ดํ„ฐ ์ž„์‹œ ํŒŒ์ผ์— ์ €์žฅ ์™„๋ฃŒ: /tmp/tmp_xxx/output.parquet -[INFO] ์›์ž์  ๊ต์ฒด ์™„๋ฃŒ: /tmp/tmp_xxx/output.parquet -> output.parquet -[INFO] _SUCCESS ํ”Œ๋ž˜๊ทธ ํŒŒ์ผ ์ƒ์„ฑ: output.parquet._SUCCESS -[DEBUG] Atomic write step timings (SUCCESS): setup=0.0012s, write_call=0.2345s, replace=0.0001s, success_flag=0.0001s, total=0.2359s +df = pd.DataFrame({ + "product_id": [101, 102, 103], + "product_name": ["Laptop", "Mouse", "Keyboard"], + "price": [1200, 25, 75] +}) + +# Save to SQL database +engine = create_engine('postgresql://user:password@localhost/dbname') +atio.write(df, format="sql", name="products", con=engine, if_exists="replace") ``` -### ์˜ค๋ฅ˜ ๋ฐœ์ƒ ์‹œ (๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•): +#### Advanced Features (Progress, Performance Monitoring) + +```python +# Save with progress display +atio.write(large_df, "big_data.parquet", format="parquet", show_progress=True) + +# Output detailed performance information +atio.write(df, "data.parquet", format="parquet", verbose=True) + +# Use Polars DataFrame +import polars as pl +polars_df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) +atio.write(polars_df, "data.parquet", format="parquet") ``` -[INFO] ์ž„์‹œ ๋””๋ ‰ํ† ๋ฆฌ ์ƒ์„ฑ: /tmp/tmp_xxx -[INFO] ์ž„์‹œ ํŒŒ์ผ ๊ฒฝ๋กœ: /tmp/tmp_xxx/output.parquet -[INFO] ์‚ฌ์šฉํ•  writer: to_parquet (format: parquet) -[ERROR] ์ž„์‹œ ํŒŒ์ผ ์ €์žฅ ์ค‘ ์˜ˆ์™ธ ๋ฐœ์ƒ: [Errno 28] No space left on device -[INFO] Atomic write failed during write stage (took 0.1246s, error: OSError) + +### `atio.write_snapshot()` - Version-Managed Table Storage + +**Purpose**: Save data in table format with version management + +**Key Parameters**: +- `obj`: Data to save +- `table_path`: Table save path +- `mode`: Save mode ('overwrite', 'append') +- `format`: Save format + +#### Version Management Usage + +```python +# Save with version management in table format +atio.write_snapshot(df, "my_table", mode="overwrite", format="parquet") + +# Add to existing data (append mode) +new_data = pd.DataFrame({"name": ["David"], "age": [40], "city": ["Daejeon"]}) +atio.write_snapshot(new_data, "my_table", mode="append", format="parquet") ``` -### ์˜ค๋ฅ˜ ๋ฐœ์ƒ ์‹œ (verbose=True): +### `atio.read_table()` - Table Data Reading + +**Purpose**: Read data from table + +**Key Parameters**: +- `table_path`: Table path +- `version`: Version number to read (None for latest) +- `output_as`: Output format ('pandas', 'polars') + +#### Table Reading Usage + +```python +# Read latest data +latest_data = atio.read_table("my_table", output_as="pandas") + +# Read specific version +version_1_data = atio.read_table("my_table", version=1, output_as="pandas") + +# Read in Polars format +polars_data = atio.read_table("my_table", output_as="polars") ``` -[INFO] ์ž„์‹œ ๋””๋ ‰ํ† ๋ฆฌ ์ƒ์„ฑ: /tmp/tmp_xxx -[INFO] ์ž„์‹œ ํŒŒ์ผ ๊ฒฝ๋กœ: /tmp/tmp_xxx/output.parquet -[INFO] ์‚ฌ์šฉํ•  writer: to_parquet (format: parquet) -[ERROR] ์ž„์‹œ ํŒŒ์ผ ์ €์žฅ ์ค‘ ์˜ˆ์™ธ ๋ฐœ์ƒ: [Errno 28] No space left on device -[DEBUG] Atomic write step timings (ERROR during write): setup=0.0012s, write_call=0.1234s (์‹คํŒจ), replace=N/A, success_flag=N/A, total=0.1246s, error_type=OSError + +### `atio.expire_snapshots()` - Old Data Cleanup + +**Purpose**: Clean up old snapshots and orphaned files + +**Key Parameters**: +- `table_path`: Table path +- `keep_for`: Retention period +- `dry_run`: Whether to actually delete (True for preview only) + +#### Data Cleanup Usage + +```python +from datetime import timedelta + +# Clean up old data (preview) +atio.expire_snapshots("my_table", keep_for=timedelta(days=7), dry_run=True) + +# Execute actual deletion +atio.expire_snapshots("my_table", keep_for=timedelta(days=7), dry_run=False) ``` -**์ธก์ •๋˜๋Š” ๋‹จ๊ณ„:** -- `setup`: ์ž„์‹œ ํด๋” ์ƒ์„ฑ ๋ฐ ์ดˆ๊ธฐ ์„ค์ • -- `write_call`: ์‹ค์ œ ๋ฐ์ดํ„ฐ ์“ฐ๊ธฐ ํ•จ์ˆ˜ ํ˜ธ์ถœ (๋Œ€๋ถ€๋ถ„์˜ ์‹œ๊ฐ„ ์†Œ์š”) -- `replace`: ์›์ž์  ํŒŒ์ผ ๊ต์ฒด -- `success_flag`: _SUCCESS ํ”Œ๋ž˜๊ทธ ํŒŒ์ผ ์ƒ์„ฑ -- `total`: ์ „์ฒด ์ž‘์—… ์‹œ๊ฐ„ - -**์ง€์›ํ•˜๋Š” ์˜ค๋ฅ˜ ์ƒํ™ฉ:** -- โœ… **KeyboardInterrupt**: ์ธํ„ฐ๋ŸฝํŠธ ๋ฐœ์ƒ ์‹œ์ ๊ณผ ์†Œ์š” ์‹œ๊ฐ„ ํ‘œ์‹œ -- โœ… **๊ถŒํ•œ ์˜ค๋ฅ˜**: ํŒŒ์ผ ์‹œ์Šคํ…œ ๊ถŒํ•œ ๋ฌธ์ œ ์ง„๋‹จ -- โœ… **๋””์Šคํฌ ๊ณต๊ฐ„ ๋ถ€์กฑ**: ์ €์žฅ ๊ณต๊ฐ„ ๋ถ€์กฑ ์ƒํ™ฉ ์ง„๋‹จ -- โœ… **๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ**: ๋ฉ”๋ชจ๋ฆฌ ์••๋ฐ• ์ƒํ™ฉ ์ง„๋‹จ -- โœ… **๋„คํŠธ์›Œํฌ ์˜ค๋ฅ˜**: ๋„คํŠธ์›Œํฌ ๋“œ๋ผ์ด๋ธŒ ์ ‘๊ทผ ๋ฌธ์ œ ์ง„๋‹จ -- โœ… **์ง€์›ํ•˜์ง€ ์•Š๋Š” ํ˜•์‹**: ์ž˜๋ชป๋œ ํŒŒ์ผ ํ˜•์‹ ์ง€์ • ์‹œ ์ง„๋‹จ -- โœ… **๋™์‹œ ์ ‘๊ทผ ์˜ค๋ฅ˜**: ๋ฉ€ํ‹ฐ์Šค๋ ˆ๋”ฉ ํ™˜๊ฒฝ์—์„œ์˜ ์ถฉ๋Œ ์ง„๋‹จ - -**์žฅ์ :** -- ๐ŸŽฏ **์ •ํ™•ํ•œ ๋ณ‘๋ชฉ์  ํŒŒ์•…**: Atio ์˜ค๋ฒ„ํ—ค๋“œ vs ์‹ค์ œ ์“ฐ๊ธฐ ์ž‘์—… ์‹œ๊ฐ„ ๊ตฌ๋ถ„ -- ๐Ÿ”ง **์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ฐ€์ด๋“œ**: ์–ด๋А ๋‹จ๊ณ„์—์„œ ์‹œ๊ฐ„์ด ๋งŽ์ด ์†Œ์š”๋˜๋Š”์ง€ ๋ช…ํ™•ํžˆ ํ‘œ์‹œ -- ๐Ÿ› **๋””๋ฒ„๊น… ์‹œ๊ฐ„ ๋‹จ์ถ•**: ๋ฌธ์ œ์˜ ์›์ธ์„ ๋น ๋ฅด๊ฒŒ ํŒŒ์•… ๊ฐ€๋Šฅ -- ๐Ÿ“Š **์„ฑ๋Šฅ ๋ชจ๋‹ˆํ„ฐ๋ง**: ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์‹œ ์„ฑ๋Šฅ ์ถ”์  -- ๐Ÿšจ **์˜ค๋ฅ˜ ์ง„๋‹จ**: ์‹คํŒจ ์ƒํ™ฉ์—์„œ๋„ ์ •ํ™•ํ•œ ์›์ธ๊ณผ ๋ฐœ์ƒ ์‹œ์  ํŒŒ์•… +## ๐Ÿ“Š Supported Formats ---- +| Format | Description | Required Parameters | Example | +|--------|-------------|-------------------|---------| +| `csv` | CSV file | `target_path` | `atio.write(df, "data.csv", format="csv")` | +| `parquet` | Parquet file | `target_path` | `atio.write(df, "data.parquet", format="parquet")` | +| `excel` | Excel file | `target_path` | `atio.write(df, "data.xlsx", format="excel")` | +| `json` | JSON file | `target_path` | `atio.write(df, "data.json", format="json")` | +| `sql` | SQL database | `name`, `con` | `atio.write(df, format="sql", name="table", con=engine)` | +| `database` | Database (Polars) | `table_name`, `connection_uri` | `atio.write(df, format="database", table_name="table", connection_uri="...")` | -## ๐Ÿง  ์™œ ์ด ๋„๊ตฌ๊ฐ€ ์ •๋ง ์ค‘์š”ํ•œ๊ฐ€์š”? +## ๐ŸŽฏ Real-World Usage Scenarios -NumPy๋‚˜ Pandas๋Š” ๋ฐ์ดํ„ฐ ๋ถ„์„์—์„œ๋Š” ์ตœ์ ์ด์ง€๋งŒ, **ํŒŒ์ผ๋กœ ์ €์žฅํ•  ๋•Œ๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ ์œ„ํ—˜**์ด ์žˆ์Šต๋‹ˆ๋‹ค: +### Scenario 1: Large CSV File Writing Interruption -1. **ํŒŒ์ผ ์ผ๋ถ€๋งŒ ์ €์žฅ๋˜์–ด ๊นจ์งˆ ์ˆ˜ ์žˆ์Œ** โ€” ๊ฐ•์ œ ์ข…๋ฃŒ๋‚˜ ์˜ค๋ฅ˜ ์‹œ -2. **๋™์‹œ ์“ฐ๊ธฐ ์ถฉ๋Œ** โ€” ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์Šค ํ™˜๊ฒฝ์—์„œ ํŒŒ์ผ์ด ์—‰ํ‚ฌ ์ˆ˜ ์žˆ์Œ -3. **ํ”Œ๋žซํผ ๊ฐ„ ๋™์ž‘ ์ฐจ์ด** โ€” Windows์™€ Linux/macOS์—์„œ ํŒŒ์ผ ์‹œ์Šคํ…œ ๋™์ž‘์ด ๋‹ค๋ฆ„ +**Problem**: A user was saving large analysis results to a .csv file using Pandas when an unexpected power outage or kernel force termination occurred. The result file was corrupted with only 3MB saved out of 50MB, and could not be read afterward. -AtomicWriter๋Š” ์ž„์‹œ ํŒŒ์ผ์— ์“ฐ๊ณ  **๋‹จ์ผ `rename()`/`replace()` ์ž‘์—…์œผ๋กœ ๊ต์ฒด**ํ•ฉ๋‹ˆ๋‹ค. -์ด ๋ฐฉ์‹์€ **โ€œ์™„์ „ํžˆ ์ €์žฅ๋˜๊ฑฐ๋‚˜ ์ „ํ˜€ ์ €์žฅ๋˜์ง€ ์•Š๋Š”โ€** ์›์ž์„ฑ(atomicity)์„ ๋ณด์žฅํ•˜๋ฉฐ, -- POSIX: `os.replace` (atomic), `fsync` -- Windows: `MoveFileEx`, `Commit` -๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํŒŒ์ผ์ด **ํ•ญ์ƒ ์ผ๊ด€๋œ ์ƒํƒœ**๋ฅผ ์œ ์ง€ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค :contentReference[oaicite:1]{index=1}. +**Atio Solution**: `atio.write()` first writes to a temporary file, then only replaces the original after all writing is successful. Therefore, even if interrupted, the existing file is preserved and corrupted temporary files are automatically cleaned up, ensuring stability. ---- +### Scenario 2: File Conflicts in Multiprocessing Environment -## โš™๏ธ ์„ค์น˜ +**Problem**: In a Python multiprocessing-based data collection pipeline, multiple processes were simultaneously saving to the same file, causing conflicts. As a result, log files were overwritten and lost, or some JSON files were saved in corrupted, unparseable forms. -```bash -pip install atomicwriter +**Atio Solution**: Using `atio.write()`'s atomic replacement method for file writing ensures that only one process can move to the final path at a time. This guarantees conflict-free, collision-free saving without race conditions. -## ๐Ÿ› ๏ธ ์‚ฌ์šฉ ์˜ˆ์ œ +### Scenario 3: Data Pipeline Validation Issues -```python -import atomicwriter as aw -import pandas as pd +**Problem**: In ETL operations, the automated system could not determine whether .parquet saving was completed, so corrupted or incomplete data was used in the next stage. This resulted in missing values in model training data, causing quality degradation. -df = pd.DataFrame({"a": [1, 2, 3]}) +**Atio Solution**: Using `atio.write_snapshot()` creates a `_SUCCESS` flag file only when saving is successfully completed. Subsequent stages can safely run the pipeline based on the presence or absence of `_SUCCESS`. -# ๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ• -aw.write(df, "output.parquet", format="parquet") -# โ”‚โ†’ ์ž„์‹œ ํŒŒ์ผ ์ž‘์„ฑ โ†’ ์›์ž์  ๊ต์ฒด โ†’ _SUCCESS ์ƒ์„ฑ -# โ”‚โ†’ ์‹คํŒจ ์‹œ ์›๋ณธ ๋ณด์กด, ์ž„์‹œ ํŒŒ์ผ ์ž๋™ ์ •๋ฆฌ +### Scenario 4: Lack of Data Version Management -# ์ƒ์„ธ ์„ฑ๋Šฅ ์ง„๋‹จ ๋กœ๊น… ํ™œ์„ฑํ™” -aw.write(df, "output_verbose.parquet", format="parquet", verbose=True) -# โ”‚โ†’ ๊ฐ ๋‹จ๊ณ„๋ณ„ ์‹คํ–‰ ์‹œ๊ฐ„ ์ธก์ • ๋ฐ ๋กœ๊ทธ ์ถœ๋ ฅ +**Problem**: As datasets for machine learning model training were updated multiple times, it became impossible to track which version of data was used to train which model. Experimental result reproducibility decreased and model performance comparison became difficult. -# ์ง„ํ–‰๋„ ํ‘œ์‹œ์™€ ํ•จ๊ป˜ ์‚ฌ์šฉ -aw.write(df, "output_progress.parquet", format="parquet", show_progress=True) -# โ”‚โ†’ ์‹ค์‹œ๊ฐ„ ์ง„ํ–‰๋„ ํ‘œ์‹œ +**Atio Solution**: Using `atio.write_snapshot()` and `atio.read_table()` allows automatic management of data versions. Snapshots are created for each version, allowing you to return to data from any specific point in time, ensuring experimental reproducibility. -# ๋ชจ๋“  ์˜ต์…˜ ์กฐํ•ฉ -aw.write(df, "output_full.parquet", format="parquet", - verbose=True, show_progress=True) -# โ”‚โ†’ ์„ฑ๋Šฅ ์ง„๋‹จ + ์ง„ํ–‰๋„ ํ‘œ์‹œ -``` +### Scenario 5: System Interruption Due to Disk Space Shortage -## ๐Ÿ’ก ๋น…๋ฐ์ดํ„ฐ ์›Œํฌํ”Œ๋กœ์šฐ์—์„œ ํ™œ์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค +**Problem**: During large data processing, the system was interrupted due to insufficient disk space. Incomplete files from processing remained, continuing to occupy disk space and requiring manual cleanup. -| ์‹œ๋‚˜๋ฆฌ์˜ค | ํ•ด๊ฒฐ ๋ฐฉ๋ฒ• | ์žฅ์  | -|------------------------|-----------------------------|-------------------------| -| Pandas โ†’ CSV ์ €์žฅ | ์ž„์‹œ ํŒŒ์ผ์— ๊ธฐ๋ก ํ›„ ๊ต์ฒด | CSV ํŒŒ์ผ ๊นจ์ง ๋ฐฉ์ง€ | -| ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์Šค ๋ณ‘๋ ฌ ์“ฐ๊ธฐ | atomic replace ๋ฐฉ์‹ ์‚ฌ์šฉ | ์ถฉ๋Œ ์—†๋Š” ์•ˆ์ „ ์ €์žฅ | -| ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ ์ž‘์—… | ์ €์žฅ ์„ฑ๊ณต ์‹œ `_SUCCESS` ํ™•์ธ | ๋ฐ์ดํ„ฐ ์™„์ „์„ฑ ๋ณด์žฅ | +**Atio Solution**: Using `atio.expire_snapshots()` allows automatic cleanup of snapshots and orphaned files older than the set retention period. You can preview files to be deleted with `dry_run=True` option, then safely perform cleanup operations. ---- +### Scenario 6: Network Error During Database Storage -## ๐Ÿ”„ ๋น„๊ต โ€“ ์œ ์‚ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํŠน์ง• ์ •๋ฆฌ +**Problem**: While saving analysis results to a PostgreSQL database, the network connection was interrupted, stopping the save operation. Partially saved tables remained in the database, breaking data integrity. -### [python-atomicwrites](https://github.com/untitaker/python-atomicwrites) -- ๊ฐ„ํŽธํ•œ API -- Windows ์ง€์› -- ํฌ๋กœ์Šค ํ”Œ๋žซํผ ํ˜ธํ™˜ +**Atio Solution**: `atio.write()`'s database storage feature uses transactions to ensure all data is either successfully saved or not saved at all. When network errors occur, automatic rollback maintains data integrity. -### atomicwriter (๋ณธ ํ”„๋กœ์ ํŠธ) -- โœ… ๊ฒฝ๋Ÿ‰ -- โœ… ํ”Œ๋Ÿฌ๊ทธ์ธ ์•„ํ‚คํ…์ฒ˜ -- โœ… Pandas / Polars / Numpy ๋“ฑ ๋ฐ์ดํ„ฐ ๊ฐ์ฒด ์ค‘์‹ฌ ์ €์žฅ ์ง€์› +### Scenario 7: Complexity in Experimental Data Management ---- +**Problem**: A research team was conducting multiple experiments simultaneously, causing experimental data to mix and making it difficult to track which data was used for which experiment. Experimental result reliability decreased and reproduction became impossible. -## โœ… ๋ผ์ด์„ ์Šค +**Atio Solution**: Using `atio.write_snapshot()` creates independent tables for each experiment, and `atio.read_table()` can read the exact data for specific experiments. Automated version management and metadata tracking for each experiment ensures research reproducibility and reliability. -Apache 2.0 โ€” ๊ธฐ์—… ๋ฐ ์ปค๋ฎค๋‹ˆํ‹ฐ ๋ชจ๋‘ ์ž์œ ๋กญ๊ฒŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅ +### Scenario 8: Data Loss During Cloud Streaming ---- +**Problem**: While processing real-time data collected from IoT sensors, system restart or network errors occurred. Data being processed was lost, breaking the continuity of important sensor data. + +**Atio Solution**: Using `atio.write_snapshot()` buffers real-time data and saves it atomically at regular intervals. After system restart, data collection can resume from the last save point, ensuring data continuity. -## โœจ ์š”์•ฝ +### Scenario 9: Memory Shortage During Large Data Processing -**AtomicWriter**๋Š” ๋ถ„์„๋งŒํผ ์ค‘์š”ํ•œ **โ€œ์ €์žฅโ€ ๋‹จ๊ณ„๋ฅผ ์•ˆ์ „ํ•˜๊ฒŒ ์ฒ˜๋ฆฌ**ํ•˜๋Š” ๋„๊ตฌ์ž…๋‹ˆ๋‹ค. +**Problem**: While processing DataFrames larger than 10GB, the process was force-terminated due to memory shortage. All intermediate results being processed were lost, requiring restart from the beginning. -ํŠนํžˆ ๋ฐ์ดํ„ฐ ๋ฌด๊ฒฐ์„ฑ์ด ์ค‘์š”ํ•œ ํ™˜๊ฒฝ์—์„œ -(์˜ˆ: ๋จธ์‹ ๋Ÿฌ๋‹ ๋ฐฐ์น˜, ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์Šค ๋ถ„์„, ์ค‘์š” ๋กœ๊ทธ ์ €์žฅ ๋“ฑ) -**์ž‘์ง€๋งŒ ๊ฐ•๋ ฅํ•œ ํ•ด๊ฒฐ์ฑ…**์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. +**Atio Solution**: Using `atio.write()`'s `show_progress=True` option along with chunk-based data processing controls memory usage. Each chunk is processed after the previous one is successfully saved, so even if it fails in the middle, already saved data is preserved. -๐Ÿ“˜ ์‹œ๋‚˜๋ฆฌ์˜ค 1: Pandas CSV ์ €์žฅ ์ค‘ ์ž‘์—… ์ค‘๋‹จ -๋ฌธ์ œ ์ƒํ™ฉ: -ํ•œ ์‚ฌ์šฉ์ž๊ฐ€ Pandas๋กœ ๋Œ€์šฉ๋Ÿ‰ ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ .csv ํŒŒ์ผ๋กœ ์ €์žฅํ•˜๋˜ ์ค‘, ์˜ˆ์ƒ์น˜ ๋ชปํ•œ ์ „์› ์ฐจ๋‹จ์ด๋‚˜ ์ปค๋„ ๊ฐ•์ œ ์ข…๋ฃŒ๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. -๊ฒฐ๊ณผ ํŒŒ์ผ์€ 50MB ์ค‘ 3MB๋งŒ ์ €์žฅ๋œ ์ฑ„ ์†์ƒ๋˜์—ˆ๊ณ , ์ดํ›„ ์ฝ๊ธฐ๋„ ๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. +### Scenario 10: Conflicts with Backup Systems -AtomicWriter๋กœ ํ•ด๊ฒฐ: -์ž„์‹œ ํŒŒ์ผ์— ๋จผ์ € ๊ธฐ๋ก ํ›„, ๋ชจ๋“  ์“ฐ๊ธฐ๊ฐ€ ์„ฑ๊ณตํ•ด์•ผ๋งŒ ์›๋ณธ๊ณผ ๊ต์ฒด๋ฉ๋‹ˆ๋‹ค. -๋”ฐ๋ผ์„œ ์ค‘๊ฐ„์— ๊บผ์ ธ๋„ ๊ธฐ์กด ํŒŒ์ผ์€ ๋ณด์กด๋˜๊ณ , ์†์ƒ๋œ ์ž„์‹œ ํŒŒ์ผ์€ ์ž๋™ ์ •๋ฆฌ๋˜์–ด ์•ˆ์ •์„ฑ์„ ํ™•๋ณดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. +**Problem**: While trying to save a large file during automatic backup system execution, the backup software attempted to backup a file being written, causing file corruption. The backup file was also saved in an incomplete state. -๐Ÿ“˜ ์‹œ๋‚˜๋ฆฌ์˜ค 2: ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์Šค ํ™˜๊ฒฝ์—์„œ ๊ฒฝ์Ÿ ์กฐ๊ฑด(Race Condition) -๋ฌธ์ œ ์ƒํ™ฉ: -Python multiprocessing ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ํŒŒ์ดํ”„๋ผ์ธ์—์„œ ์—ฌ๋Ÿฌ ํ”„๋กœ์„ธ์Šค๊ฐ€ ๋™์‹œ์— ๊ฐ™์€ ํŒŒ์ผ์„ ์ €์žฅํ•˜๋ฉฐ ์ถฉ๋Œ์ด ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. -๊ฒฐ๊ณผ์ ์œผ๋กœ ๋กœ๊ทธ ํŒŒ์ผ์ด ๋ฎ์–ด์“ฐ์—ฌ ๋ˆ„๋ฝ๋˜๊ฑฐ๋‚˜, ์ผ๋ถ€ JSON ํŒŒ์ผ์€ ํŒŒ์‹ฑํ•  ์ˆ˜ ์—†๋Š” ์†์ƒ๋œ ํ˜•ํƒœ๋กœ ์ €์žฅ๋์Šต๋‹ˆ๋‹ค. +**Atio Solution**: Using `atio.write()`'s atomic replacement method for file saving ensures that backup systems only see complete files when reading. Temporary files are excluded from backup targets, enabling conflict-free, safe backups. -AtomicWriter๋กœ ํ•ด๊ฒฐ: -ํŒŒ์ผ ์“ฐ๊ธฐ๋ฅผ atomic replace ๋ฐฉ์‹์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋ฉด, ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์˜ ํ”„๋กœ์„ธ์Šค๋งŒ ์ตœ์ข… ๊ฒฝ๋กœ๋กœ ์ด๋™ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. -์ด๋กœ์จ ๊ฒฝ์Ÿ ์กฐ๊ฑด ์—†์ด ์ถฉ๋Œ ์—†์ด ์ €์žฅ์ด ๋ณด์žฅ๋ฉ๋‹ˆ๋‹ค. +## ๐Ÿ” Performance Monitoring + +```python +# Output detailed performance information +atio.write(df, "data.parquet", format="parquet", verbose=True) +``` + +Output example: +``` +[INFO] Temporary directory created: /tmp/tmp12345 +[INFO] Temporary file path: /tmp/tmp12345/data.parquet +[INFO] Writer to use: to_parquet (format: parquet) +[INFO] โœ… File writing completed (total time: 0.1234s) +``` -๐Ÿ“˜ ์‹œ๋‚˜๋ฆฌ์˜ค 3: ๋ฐ์ดํ„ฐ ํŒŒ์ดํ”„๋ผ์ธ ๊ฒ€์ฆ ๋ถˆ๊ฐ€ -๋ฌธ์ œ ์ƒํ™ฉ: -ETL ์ž‘์—…์—์„œ .parquet ์ €์žฅ์ด ์™„๋ฃŒ๋๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ์ž๋™ ์‹œ์Šคํ…œ์ด ํŒ๋‹จํ•  ์ˆ˜ ์—†์–ด, ์†์ƒ๋˜๊ฑฐ๋‚˜ ๋ฏธ์™„์„ฑ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์Œ ๋‹จ๊ณ„์—์„œ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. -๊ฒฐ๊ณผ์ ์œผ๋กœ ๋ชจ๋ธ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๊ฒฐ์ธก๊ฐ’์ด ํฌํ•จ๋˜์–ด ํ’ˆ์งˆ ์ €ํ•˜๊ฐ€ ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. +## ๐Ÿ“ฆ Dependencies + +### Required Dependencies +- Python 3.7+ +- pandas +- numpy + +### Optional Dependencies +- `pyarrow` or `fastparquet`: Parquet format support +- `openpyxl` or `xlsxwriter`: Excel format support +- `sqlalchemy`: SQL database support +- `polars`: Polars DataFrame support + +## ๐Ÿ“„ License + +This project is distributed under the Apache 2.0 License. See the [LICENSE](LICENSE) file for details. + +## ๐Ÿ› Bug Reports + +Found a bug? Please report it on the [Issues](https://github.com/seojaeohcode/atio/issues) page. + +--- -AtomicWriter๋กœ ํ•ด๊ฒฐ: -์ €์žฅ์ด ์„ฑ๊ณต์ ์œผ๋กœ ์™„๋ฃŒ๋œ ๊ฒฝ์šฐ์—๋งŒ _SUCCESS ํ”Œ๋ž˜๊ทธ ํŒŒ์ผ์„ ํ•จ๊ป˜ ์ƒ์„ฑํ•˜๋„๋ก ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. -ํ›„์† ๋‹จ๊ณ„๋Š” _SUCCESS ์œ ๋ฌด๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์•ˆ์ „ํ•˜๊ฒŒ ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ตฌ๋™ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. +
-๐Ÿ“˜ ์‹œ๋‚˜๋ฆฌ์˜ค 4: Polars DataFrame์„ S3๋กœ ์ €์žฅ ์ค‘ ์˜ค๋ฅ˜ ๋ฐœ์ƒ -๋ฌธ์ œ ์ƒํ™ฉ: -Polars DataFrame์„ AWS S3์— ์ง์ ‘ ์ €์žฅํ•˜๋Š” ์ค‘๊ฐ„์— ConnectionError๊ฐ€ ๋ฐœ์ƒํ•˜์—ฌ S3์—๋Š” ๋ถ€๋ถ„์ ์œผ๋กœ ๊นจ์ง„ .parquet ํŒŒ์ผ์ด ์˜ฌ๋ผ๊ฐ”์Šต๋‹ˆ๋‹ค. -๋‹ค์Œ ๋ฒˆ ์‹คํ–‰์—์„œ ์ด ํŒŒ์ผ์„ ์žฌ์‚ฌ์šฉํ•˜๋ ค ํ–ˆ์ง€๋งŒ, S3์—์„œ ํŒŒ์ผ์ด ์†์ƒ๋œ ์ฑ„๋กœ ์กด์žฌํ•ด ์˜ค๋ฅ˜๋ฅผ ์œ ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. +**Atio** - Safe and Fast Data Writing Library ๐Ÿš€ -AtomicWriter๋กœ ํ•ด๊ฒฐ: -๋กœ์ปฌ ์ž„์‹œ ํŒŒ์ผ์— ์™„์ „ํžˆ ์ €์žฅ๋œ ํ›„์—๋งŒ S3 ์—…๋กœ๋“œ ๋˜๋Š” ๊ต์ฒด๊ฐ€ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. -๋„คํŠธ์›Œํฌ ์ด์Šˆ๋‚˜ ๋””์Šคํฌ ์˜ค๋ฅ˜์—๋„ ์ตœ์ข… ํŒŒ์ผ์€ ํ•ญ์ƒ ์™„์ „ํ•œ ์ƒํƒœ๋กœ๋งŒ ์กด์žฌํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. \ No newline at end of file +
\ No newline at end of file