Merge branch 'release/v1.0.2'

baeda-polito · Sep 17, 2024 · 848f03b · 848f03b
2 parents 42854d7 + 70386b7
commit 848f03b
Show file tree

Hide file tree

Showing 14 changed files with 719 additions and 88 deletions.
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -1,4 +1,4 @@
-name: Release
+name: Release package
 
 on:
   # should work

diff --git a/.gitignore b/.gitignore
@@ -111,4 +111,7 @@ GitHub.sublime-settings
 *.png
 .DS_Store
 !data.csv
-!example.png
+!example.png
+!cmp.html
+!time_window_corrected.csv
+!group_cluster.csv
diff --git a/Dockerfile b/Dockerfile
@@ -4,7 +4,7 @@
 # If you need more help, visit the Dockerfile reference guide at
 # https://docs.docker.com/engine/reference/builder/
 
-ARG PYTHON_VERSION=3.12.3
+ARG PYTHON_VERSION=3.11
 FROM python:${PYTHON_VERSION}-slim as base
 
 # Prevents Python from writing pyc files.
@@ -36,8 +36,12 @@ RUN adduser \
 # Leverage a bind mount to requirements.txt to avoid having to copy them into
 # into this layer.
 RUN --mount=type=cache,target=/root/.cache/pip \
-    --mount=type=bind,source=requirements.txt,target=requirements.txt \
-    python -m pip install -r requirements.txt
+    --mount=type=bind,source=pyproject.toml,target=pyproject.toml \
+    --mount=type=bind,source=poetry.lock,target=poetry.lock \
+    python -m pip install poetry && \
+    poetry config virtualenvs.create false && \
+    poetry install --only main --no-interaction --no-ansi && \
+    rm -rf $POETRY_CACHE_DIR
 
 # Switch to the non-privileged user to run the application.
 USER appuser

diff --git a/Makefile b/Makefile
@@ -36,13 +36,15 @@ docker-run:
 
 
 .PHONY: rm-git-cache
+rm-git-cache: ## Remove git cached files
 rm-git-cache:
 	@echo "Removing git cached files"
 	git add .
 	git rm -r --cached .
 	git add .
 
 .PHONY: setup
+setup: ## Setup the project
 setup:
 	@if [ ! -d "${VENV}" ]; then \
 		echo "Creating venv"; \

diff --git a/README.md b/README.md
@@ -1,9 +1,13 @@
 # Contextual Matrix Profile Calculation Tool
 
-The Matrix Profile has the potential to revolutionize time series data mining because of its generality, versatility,
-simplicity and scalability. In particular it has implications for time series motif discovery, time series joins,
-shapelet discovery (classification), density estimation, semantic segmentation, visualization, rule discovery,
-clustering etc.
+Matrix Profile is an algorithm capable to discover motifs and discords in time series data. It is a powerful tool that
+by calculating the (z-normalized) Euclidean distance between any subsequence within a time series and its nearest
+neighbor it is able to provide insights on potential anomalies and/or repetitive patterns. In the field of building
+energy management it can be employed to detect anomalies in electrical load timeseries.
+
+This tool is a Python implementation of the Matrix Profile algorithm that employs contextual information (such as
+external air temperature) to identify abnormal pattens in electrical load subsequences that start in predefined sub
+daily time windows, as shown in the following figure.
 
 ![](./docs/example.png)
 
@@ -20,7 +24,7 @@ clustering etc.
 
 ## Usage
 
-The tool comes with a cli that helps you to execute the script with the desired commands
+The tool comes with a CLI that helps you to execute the script with the desired commands
 
 ```console 
 $ python -m src.cmp.main -h
@@ -39,20 +43,41 @@ options:
 The arguments to pass to the script are the following:
 
 * `input_file`: The input dataset via an HTTP URL. The tool should then download the dataset from that URL; since it's a
-  presigned URL, the tool would not need to deal with authentication—it can just download the dataset directly.
+  pre-signed URL, the tool would not need to deal with authentication—it can just download the dataset directly.
 * `variable_name`: The variable name to be used for the analysis (i.e., the column of the csv that contains the
   electrical load under analysis).
 * `output_file`: The local path to the output HTML report. The platform would then get that HTML report and upload it to
   the object
   storage service for the user to review later.
 
 You can run the main script through the console using either local files or download data from an external url. This
-repository comes with a sample dataset (data.csv) that you can use to generate a report and you can pass the local path
+repository comes with a sample dataset ([data.csv](.src/cmp/data/data.csv)) that you can use to generate a report and
+you can pass the local path
 as `input_file` argument as follows:
 
 ### Data format
 
-todo
+The tool requires the user to provide a csv file as input that contains electrical power timeseries for a specific
+building, meter or energy system (e.g., whole building electrical power timeseries). The `csv` is a wide table format as
+follows:
+
+```csv
+timestamp,column_1,temp
+2019-01-01 00:00:00,116.4,-0.6
+2019-01-01 00:15:00,125.6,-0.9
+2019-01-01 00:30:00,119.2,-1.2
+```
+
+The csv must have the following columns:
+
+- `timestamp` [case sensitive]: The timestamp of the observation in the format `YYYY-MM-DD HH:MM:SS`. This column is
+  supposed to be in
+  UTC timezone string format. It will be internally transformed by the tool into the index of the dataframe.
+- `temp` [case sensitive]: Contains the external air temperature in Celsius degrees. This column is required to perform
+  thermal sensitive
+  analysis on the electrical load.
+- `column_1`: Then the dataframe may have `N` arbitrary columns that refers to electrical load time series. The user has
+  to specify the column name that refers to the electrical load time series in the `variable_name` argument.
 
 ### Run locally
 
@@ -62,6 +87,7 @@ Create virtual environment and activate it and install dependencies:
   ```bash
   make setup
   ```
+
 - Linux:
   ```bash
   python3 -m venv .venv
@@ -134,44 +160,6 @@ Run the docker image with the same arguments as before
 At the end of the execution you can find the results in the [`results`](src/cmp/results) folder inside the docker
 container.
 
-## Additional Information
-
-```
-# 2) User Defined Context
-# We want to find all the subsequences that start from 00:00 to 02:00 (2 hours) and covers the whole day
-# In order to avoid overlapping we define the window length as the whole day of
-# observation minus the context length.
-
-# - Beginning of the context 00:00 AM [hours]
-context_start = 17
-
-# - End of the context 02:00 AM [hours]
-context_end = 19
-
-# - Context time window length 2 [hours]
-m_context = context_end - context_start  # 2
-
-# - Time window length [observations]
-# m = 96 [observations] - 4 [observation/hour] * 2 [hours] = 88 [observations] = 22 [hours]
-# m = obs_per_day - obs_per_hour * m_context
-m = 20  # with guess
-
-# Context Definition:
-# example FROM 00:00 to 02:00
-# - m_context = 2 [hours]
-# - obs_per_hour = 4 [observations/hour]
-# - context_start = 0 [hours]
-# - context_end = context_start + m_context = 0 [hours] + 2 [hours] = 2 [hours]
-contexts = GeneralStaticManager([
-    range(
-        # FROM  [observations]  = x * 96 [observations] + 0 [hour] * 4 [observation/hour]
-        (x * obs_per_day) + context_start * obs_per_hour,
-        # TO    [observations]  = x * 96 [observations] + (0 [hour] + 2 [hour]) * 4 [observation/hour]
-        (x * obs_per_day) + (context_start + m_context) * obs_per_hour)
-    for x in range(len(data) // obs_per_day)
-])
-```
-
 ## Cite
 
 You can cite this work by using the following reference or either though [this Bibtex file](./docs/ref.bib) or the
@@ -183,7 +171,12 @@ following plain text citation
 
 ## Contributors
 
-- [Roberto Chiosa](https://github.com/RobertoChiosa)
+- Author [Roberto Chiosa](https://github.com/RobertoChiosa)
+
+## References
+
+- Series Distance Matrix repository (https://github.com/predict-idlab/seriesdistancematrix)
+- Stumpy Package (https://stumpy.readthedocs.io/en/latest/)
 
 ## License
 

diff --git a/RELEASE.md b/RELEASE.md
@@ -1,3 +1,5 @@
 ### Features:
 
-- Added command line interface for the tool
+- Fixed csv path issues
+- Documentation updates
+- Docker run support