LisaKey
diff --git a/‎python/Project - Debugging a Sales Data Workflow/binary.jpg
142 KB b/‎python/Project - Debugging a Sales Data Workflow/binary.jpg
142 KB
diff --git a/‎python/Project - Debugging a Sales Data Workflow/notebook.ipynb
+1 b/‎python/Project - Debugging a Sales Data Workflow/notebook.ipynb
+1
@@ -0,0 +1 @@
+{"cells":[{"source":"![Binary code with a magnifying glass](binary.jpg)\n\nAs a data engineer, you often face unexpected challenges in workflows. In this scenario, the `load_and_check()` function, in charge of managing sales data, encounters issues after the latest update. Unfortunately, your colleague who usually handles this code is currently on holiday, leaving you to troubleshoot.\n\nYour task is to identify and address the issues in the sales data pipeline **without getting into every line of code.** The `load_and_check()` function loads the `sales.csv` dataset and performs several checks. Initially, it verifies the dataset's shape, ensuring it matches expectations. Subsequently, integrity checks are conducted to maintain data consistency and flag any anomalies.\n\nThe `sales.csv` dataset has various columns, focusing on critical fields such as `Total`, `Quantity`, `Unit price`, `Tax`, and `Date`. It's essential that the `Tax` column accurately represents 5% of the subtotal, calculated from the `Unit Price` multiplied by `Quantity`.\n\n**Your goal is to sort out the pipeline issues, aiming for the code to return 2 success messages upon completion.** While at it, try to keep the original structure as much as possible. Only change existing columns if necessary, and make sure the data remains accurate. Be mindful of updating any relevant if statements in the checks as needed.","metadata":{"collapsed":false,"jupyter":{"outputs_hidden":false,"source_hidden":false}},"id":"740d0831-d810-4985-86d2-a1efabbf669e","cell_type":"markdown"},{"source":"import pandas as pd\n\ndef load_and_check():\n    # Step 1: Load the data and check if it has the expected shape\n    data = pd.read_csv('sales.csv')  \n    #print(data.head())\n    #print(data.shape)\n    #print(data.shape[1])\n    #print(f\"Data shape: {data.shape}\")\n    if data.shape[1] != 17:\n        print(\"Please check that the data was loaded properly!\")\n        #print(\"Columns:\", data.columns)\n    else:\n        print(\"Data loaded successfully.\")\n\n    # Step 2: Calculate statistical values and merge with the original data\n    grouped_data = data.groupby(['Date'])['Total'].agg(['mean', 'std'])\n    #print(grouped_data.head())\n    grouped_data['threshold'] = 3 * grouped_data['std']\n    #print(grouped_data.head())\n    grouped_data['max'] = grouped_data['mean'] + grouped_data.threshold\n    #print(grouped_data.head())\n    grouped_data['min'] = grouped_data[['mean', 'threshold']].apply(lambda row: max(0, row['mean'] - row['threshold']), axis=1)\n    #print(grouped_data.head())\n    data = pd.merge(data, grouped_data, on='Date', how='left')\n    \n    #print(data[['Date', 'Total', 'min','max']].head())\n    \n    data['Tax'] = (data['Quantity'] * data['Unit price']).astype(float) * 0.05\n\n    # Condition_1 checks if 'Total' is within the acceptable range (min to max) for each date\n    data['Condition_1'] = (data['Total'] >= data['min']) & (data['Total'] <= data['max'])\n    data['Condition_1'].fillna(False, inplace=True)  \n    #print(data[['Date','Condition_1', 'Total', 'min','max']].head())\n    \n    # Condition_2 checks if the 'Tax' column is properly calculated as 5% of (Quantity * Unit price)\n    \n    #data['Condition_2'] = round(data['Quantity'] * data['Unit price'] * 0.05, 1) == round(data['Tax'], 1)\n    #print(data[['Date','Condition_2', 'Quantity', 'Unit price','Tax']].head()) # ici on remarque que la Tax est null dans certains cas \n    \n    data['Tax'] = (data['Quantity'] * data['Unit price']).astype(float) * 0.05\n    #data.loc[data['Tax'] == 0.00, 'Tax'] = data['Quantity'] * data['Unit price'] * 0.05\n    data['Condition_2'] = round(data['Quantity'] * data['Unit price'] * 0.05, 1) == round(data['Tax'], 1)\n    # Step 3: Check if all rows pass both Condition_1 and Condition_2\n    # Success indicates data integrity; failure suggests potential issues.\n    #print(data.shape[0])\n    #print(data['Condition_1'].sum())\n    #print(data['Condition_2'].sum()) # on confirme ici que la condition_2 n'est pas respecté\n    if (data['Condition_1'].sum() == data.shape[0]) and (data['Condition_2'].sum() == data.shape[0]):\n        print(\"Data integrity check was successful! All rows pass the integrity conditions.\")\n    else:\n         print(\"Something fishy is going on with the data! Integrity check failed for some rows!\")\n        \n    return data\n\nprocessed_data = load_and_check()","metadata":{"executionCancelledAt":null,"executionTime":53,"lastExecutedAt":1730719349898,"lastExecutedByKernel":"7626f192-1cde-4c06-be03-b1470407b542","lastScheduledRunId":null,"lastSuccessfullyExecutedCode":"import pandas as pd\n\ndef load_and_check():\n    # Step 1: Load the data and check if it has the expected shape\n    data = pd.read_csv('sales.csv')  \n    #print(data.head())\n    #print(data.shape)\n    #print(data.shape[1])\n    #print(f\"Data shape: {data.shape}\")\n    if data.shape[1] != 17:\n        print(\"Please check that the data was loaded properly!\")\n        #print(\"Columns:\", data.columns)\n    else:\n        print(\"Data loaded successfully.\")\n\n    # Step 2: Calculate statistical values and merge with the original data\n    grouped_data = data.groupby(['Date'])['Total'].agg(['mean', 'std'])\n    #print(grouped_data.head())\n    grouped_data['threshold'] = 3 * grouped_data['std']\n    #print(grouped_data.head())\n    grouped_data['max'] = grouped_data['mean'] + grouped_data.threshold\n    #print(grouped_data.head())\n    grouped_data['min'] = grouped_data[['mean', 'threshold']].apply(lambda row: max(0, row['mean'] - row['threshold']), axis=1)\n    #print(grouped_data.head())\n    data = pd.merge(data, grouped_data, on='Date', how='left')\n    \n    #print(data[['Date', 'Total', 'min','max']].head())\n    \n    data['Tax'] = (data['Quantity'] * data['Unit price']).astype(float) * 0.05\n\n    # Condition_1 checks if 'Total' is within the acceptable range (min to max) for each date\n    data['Condition_1'] = (data['Total'] >= data['min']) & (data['Total'] <= data['max'])\n    data['Condition_1'].fillna(False, inplace=True)  \n    #print(data[['Date','Condition_1', 'Total', 'min','max']].head())\n    \n    # Condition_2 checks if the 'Tax' column is properly calculated as 5% of (Quantity * Unit price)\n    \n    #data['Condition_2'] = round(data['Quantity'] * data['Unit price'] * 0.05, 1) == round(data['Tax'], 1)\n    #print(data[['Date','Condition_2', 'Quantity', 'Unit price','Tax']].head()) # ici on remarque que la Tax est null dans certains cas \n    \n    data['Tax'] = (data['Quantity'] * data['Unit price']).astype(float) * 0.05\n    #data.loc[data['Tax'] == 0.00, 'Tax'] = data['Quantity'] * data['Unit price'] * 0.05\n    data['Condition_2'] = round(data['Quantity'] * data['Unit price'] * 0.05, 1) == round(data['Tax'], 1)\n    # Step 3: Check if all rows pass both Condition_1 and Condition_2\n    # Success indicates data integrity; failure suggests potential issues.\n    #print(data.shape[0])\n    #print(data['Condition_1'].sum())\n    #print(data['Condition_2'].sum()) # on confirme ici que la condition_2 n'est pas respecté\n    if (data['Condition_1'].sum() == data.shape[0]) and (data['Condition_2'].sum() == data.shape[0]):\n        print(\"Data integrity check was successful! All rows pass the integrity conditions.\")\n    else:\n         print(\"Something fishy is going on with the data! Integrity check failed for some rows!\")\n        \n    return data\n\nprocessed_data = load_and_check()","outputsMetadata":{"0":{"height":616,"type":"stream"}},"collapsed":false,"jupyter":{"outputs_hidden":false,"source_hidden":false}},"cell_type":"code","id":"f4c3d81d-5405-4211-a9b5-f4d3cffbdede","outputs":[],"execution_count":22}],"metadata":{"colab":{"name":"Welcome to DataCamp Workspaces.ipynb","provenance":[]},"kernelspec":{"display_name":"Python 3 (ipykernel)","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.10"}},"nbformat":4,"nbformat_minor":5}
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	+{"cells":[{"source":"![Binary code with a magnifying glass](binary.jpg)\n\nAs a data engineer, you often face unexpected challenges in workflows. In this scenario, the `load_and_check()` function, in charge of managing sales data, encounters issues after the latest update. Unfortunately, your colleague who usually handles this code is currently on holiday, leaving you to troubleshoot.\n\nYour task is to identify and address the issues in the sales data pipeline without getting into every line of code. The `load_and_check()` function loads the `sales.csv` dataset and performs several checks. Initially, it verifies the dataset's shape, ensuring it matches expectations. Subsequently, integrity checks are conducted to maintain data consistency and flag any anomalies.\n\nThe `sales.csv` dataset has various columns, focusing on critical fields such as `Total`, `Quantity`, `Unit price`, `Tax`, and `Date`. It's essential that the `Tax` column accurately represents 5% of the subtotal, calculated from the `Unit Price` multiplied by `Quantity`.\n\nYour goal is to sort out the pipeline issues, aiming for the code to return 2 success messages upon completion. While at it, try to keep the original structure as much as possible. Only change existing columns if necessary, and make sure the data remains accurate. Be mindful of updating any relevant if statements in the checks as needed.","metadata":{"collapsed":false,"jupyter":{"outputs_hidden":false,"source_hidden":false}},"id":"740d0831-d810-4985-86d2-a1efabbf669e","cell_type":"markdown"},{"source":"import pandas as pd\n\ndef load_and_check():\n # Step 1: Load the data and check if it has the expected shape\n data = pd.read_csv('sales.csv') \n #print(data.head())\n #print(data.shape)\n #print(data.shape[1])\n #print(f\"Data shape: {data.shape}\")\n if data.shape[1] != 17:\n print(\"Please check that the data was loaded properly!\")\n #print(\"Columns:\", data.columns)\n else:\n print(\"Data loaded successfully.\")\n\n # Step 2: Calculate statistical values and merge with the original data\n grouped_data = data.groupby(['Date'])['Total'].agg(['mean', 'std'])\n #print(grouped_data.head())\n grouped_data['threshold'] = 3 * grouped_data['std']\n #print(grouped_data.head())\n grouped_data['max'] = grouped_data['mean'] + grouped_data.threshold\n #print(grouped_data.head())\n grouped_data['min'] = grouped_data[['mean', 'threshold']].apply(lambda row: max(0, row['mean'] - row['threshold']), axis=1)\n #print(grouped_data.head())\n data = pd.merge(data, grouped_data, on='Date', how='left')\n \n #print(data[['Date', 'Total', 'min','max']].head())\n \n data['Tax'] = (data['Quantity'] * data['Unit price']).astype(float) * 0.05\n\n # Condition_1 checks if 'Total' is within the acceptable range (min to max) for each date\n data['Condition_1'] = (data['Total'] >= data['min']) & (data['Total'] <= data['max'])\n data['Condition_1'].fillna(False, inplace=True) \n #print(data[['Date','Condition_1', 'Total', 'min','max']].head())\n \n # Condition_2 checks if the 'Tax' column is properly calculated as 5% of (Quantity * Unit price)\n \n #data['Condition_2'] = round(data['Quantity'] * data['Unit price'] * 0.05, 1) == round(data['Tax'], 1)\n #print(data[['Date','Condition_2', 'Quantity', 'Unit price','Tax']].head()) # ici on remarque que la Tax est null dans certains cas \n \n data['Tax'] = (data['Quantity'] * data['Unit price']).astype(float) * 0.05\n #data.loc[data['Tax'] == 0.00, 'Tax'] = data['Quantity'] * data['Unit price'] * 0.05\n data['Condition_2'] = round(data['Quantity'] * data['Unit price'] * 0.05, 1) == round(data['Tax'], 1)\n # Step 3: Check if all rows pass both Condition_1 and Condition_2\n # Success indicates data integrity; failure suggests potential issues.\n #print(data.shape[0])\n #print(data['Condition_1'].sum())\n #print(data['Condition_2'].sum()) # on confirme ici que la condition_2 n'est pas respecté\n if (data['Condition_1'].sum() == data.shape[0]) and (data['Condition_2'].sum() == data.shape[0]):\n print(\"Data integrity check was successful! All rows pass the integrity conditions.\")\n else:\n print(\"Something fishy is going on with the data! Integrity check failed for some rows!\")\n \n return data\n\nprocessed_data = load_and_check()","metadata":{"executionCancelledAt":null,"executionTime":53,"lastExecutedAt":1730719349898,"lastExecutedByKernel":"7626f192-1cde-4c06-be03-b1470407b542","lastScheduledRunId":null,"lastSuccessfullyExecutedCode":"import pandas as pd\n\ndef load_and_check():\n # Step 1: Load the data and check if it has the expected shape\n data = pd.read_csv('sales.csv') \n #print(data.head())\n #print(data.shape)\n #print(data.shape[1])\n #print(f\"Data shape: {data.shape}\")\n if data.shape[1] != 17:\n print(\"Please check that the data was loaded properly!\")\n #print(\"Columns:\", data.columns)\n else:\n print(\"Data loaded successfully.\")\n\n # Step 2: Calculate statistical values and merge with the original data\n grouped_data = data.groupby(['Date'])['Total'].agg(['mean', 'std'])\n #print(grouped_data.head())\n grouped_data['threshold'] = 3 * grouped_data['std']\n #print(grouped_data.head())\n grouped_data['max'] = grouped_data['mean'] + grouped_data.threshold\n #print(grouped_data.head())\n grouped_data['min'] = grouped_data[['mean', 'threshold']].apply(lambda row: max(0, row['mean'] - row['threshold']), axis=1)\n #print(grouped_data.head())\n data = pd.merge(data, grouped_data, on='Date', how='left')\n \n #print(data[['Date', 'Total', 'min','max']].head())\n \n data['Tax'] = (data['Quantity'] * data['Unit price']).astype(float) * 0.05\n\n # Condition_1 checks if 'Total' is within the acceptable range (min to max) for each date\n data['Condition_1'] = (data['Total'] >= data['min']) & (data['Total'] <= data['max'])\n data['Condition_1'].fillna(False, inplace=True) \n #print(data[['Date','Condition_1', 'Total', 'min','max']].head())\n \n # Condition_2 checks if the 'Tax' column is properly calculated as 5% of (Quantity * Unit price)\n \n #data['Condition_2'] = round(data['Quantity'] * data['Unit price'] * 0.05, 1) == round(data['Tax'], 1)\n #print(data[['Date','Condition_2', 'Quantity', 'Unit price','Tax']].head()) # ici on remarque que la Tax est null dans certains cas \n \n data['Tax'] = (data['Quantity'] * data['Unit price']).astype(float) * 0.05\n #data.loc[data['Tax'] == 0.00, 'Tax'] = data['Quantity'] * data['Unit price'] * 0.05\n data['Condition_2'] = round(data['Quantity'] * data['Unit price'] * 0.05, 1) == round(data['Tax'], 1)\n # Step 3: Check if all rows pass both Condition_1 and Condition_2\n # Success indicates data integrity; failure suggests potential issues.\n #print(data.shape[0])\n #print(data['Condition_1'].sum())\n #print(data['Condition_2'].sum()) # on confirme ici que la condition_2 n'est pas respecté\n if (data['Condition_1'].sum() == data.shape[0]) and (data['Condition_2'].sum() == data.shape[0]):\n print(\"Data integrity check was successful! All rows pass the integrity conditions.\")\n else:\n print(\"Something fishy is going on with the data! Integrity check failed for some rows!\")\n \n return data\n\nprocessed_data = load_and_check()","outputsMetadata":{"0":{"height":616,"type":"stream"}},"collapsed":false,"jupyter":{"outputs_hidden":false,"source_hidden":false}},"cell_type":"code","id":"f4c3d81d-5405-4211-a9b5-f4d3cffbdede","outputs":[],"execution_count":22}],"metadata":{"colab":{"name":"Welcome to DataCamp Workspaces.ipynb","provenance":[]},"kernelspec":{"display_name":"Python 3 (ipykernel)","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.10"}},"nbformat":4,"nbformat_minor":5}