This is a project to analyze data from ESA's Solar Orbiter for a period of 2 years, from 1st of January 2022 to 1st of January 2024.
All the data needed has already been provided in the Data folder as Solar_Orbiter.csv
.
There are two primary tasks which I want to perform:
- Build a Dashboard of all the instruments' behavior over time.
- Detect anomalies in the data to understand on which dates the spacecraft was doing something interesting.
To get started, ensure you have Python3 installed on your system. You can download the recommended stable version 3.11.8(the version this code was written in) from here: Python Download
-
Build a virtual environment (optional) - This will help keep your project isolated from other python modules on your device.
- Make a Virtual Environment by typing this in your console:
python3 -m venv my_project
- Activate it
source my_project/bin/activate
- Change
my_project
with whatever name you want to give it.
- Make a Virtual Environment by typing this in your console:
-
Install all the dependencies of the project by typing this in your console:
pip install -r Setup_File/Requirements.txt
To quickly run everything and see main results, run this in terminal:
python3 Run_Ml_Models.py
after changing working directory bycd Python_Scripts
python3 Dashboard.py
while being in the samePython_Scripts
directory. ( Confirm it using,pwd
)
Open the Data Folder. It has Solar_Orbiter.csv
. This file contains, per day, mean values of:
- Radial Distance from Sun (AU)
- Electronic Box Temperature (DegC)
- Out Board Sensor Temperature (DegC)
- In Board Sensor Temperature (DegC)
- Search Coil Magnetometers Temperature (DegC)
- Solar Array Angle (Deg)
- High Gain Antenna azimuth (Deg)
TO KEEP THIS BRIEF(ER), I HAVE PROVIDED MORE DETAILED EXPLANATIONS IN CODE COMMENTS
- Open the Python_Scripts folder.
- Here, you will find a file called
Run_Ml_Models.py
. - Running this in the terminal from the
Python_Scripts
will detect all the anomalies within the dataset using the Isolation Forest model. - The output will be stored as
Solar_Orbiter_With_Anomalies.csv
in theData
folder
- The Isolation Forest algorithm is an unsupervised learning algorithm for anomaly detection that works by:
- Randomly selecting a feature and a split value between the maximum and minimum values of that feature.
- Repeating this process recursively to create a tree-like structure.
- Anomalies are isolated in the tree with a shorter path length, i.e., fewer splits.
References:
- SHAP values are used to explain the decisions of the Isolation Forest model.
-
SHAP (SHapley Additive exPlanations) values derive from game theory and provide insights into the contribution of each feature to a specific prediction made by the model.
-
As can be seen, we find the marginal contribution of each feature, and multiply it by the inverse of prouct of permutations of all possible sets of data and the set of data selected.
-
These will be calculated upon running the
Run_Ml_Models.py
file from thePython_Scripts
directory. -
The functions to support calculations of these are put in a separate file, called,
Helpers.py
.
- Visualization:
- A visualization for the mean absolute value of SHAP values to get feature importance is created and stored in the
Python_Scripts/Explainability
folder, based on section 9.6.5 of textbook of Interpretable ML Book - It shows that Temperature of the Outboard sensor causes the maximum amount of output change in predicting anomalies
References:
- Interpretable ML Book - SHAP (Section 9.6.5 SHAP Feature Importance, Section 9.6 SHAP)
- PyData Conference on SHAP - Tel Aviv
- SHAP Documentation
- Output:
- We will have
Solar_Orbiter_With_Anomalies.csv
within the Data folder saved ( This contains the original database with anomaly scores explained in code) - We will have
Shap_Values_Plot.html
saved inPython_Scripts/Explainability
, containing the visualisation of Feature importance
- Local run - The dashboard, can be run on a local server on your own system by running the
Dashboard.py file
within thePython_Scripts
folder
- The Dashboard consists of 4 key visualisations,
- Time Series Chart: This shows how each of the features varies over time and gives us insights about how the data looks overall
- Correlation Heatmap: This calculates correlation coefficient between several features and displays it in the form of a heatmap. Interestingly, Solar Array Angle is highly correlated to Radial Distance from the sun. This is because, the Solar Arrays change their angle to point in the direction of the sun.
- Anomaly Score Chart: This is used to find the anomalous dates within the spacecraft. Lower the score, more anomalous the date. Interestingly, 4 May to 11 May are identified as anomalous dates by the model. This is consistent with the fact that the spacecraft was having high noise period around that time ( https://www.cosmos.esa.int/web/soar/support-data )
- Feature Importance Plot: This is simply embedded from the explainability folder using the Dash HTML component Iframe
- Deploying using Render - I have deployed the dashboard on the web using render, largely by following a tutorial
- Please follow this tutorial for doing the same https://www.youtube.com/watch?v=XWJBJoV5yww&t=0s
- For the same, you will find the entire Dashboard named as app and all the needed things within src folder in the
Deploy_With_Render
folder - Copy the Deploy_With_Render directory and open it in a separate project to avoid nested git repositories
- Ensure you have dash-tools installed, it is their in requirements.txt ( so I am assuming it is installed or do pip install dash-tools)
- type
dashtools gui
in terminal - Go to Deploy section on the newly opened page
- Open your file there, by putting the path of your folder in the text box
- Follow the instruction further in the tutorial and you will be able to deploy it, just like this: https://my-render-jh3k.onrender.com/
- Memory Profiling:
-
We use the memoryprofiler library to do memory profiling
-
The results are stored in
Scalability/Memory_Profiling
-
From the results, it can be seen that within the dashboard, every line involves about 120 Mib of memory while callback requires 120 Mib recurrently
-
Also, within the
Run_Ml_Models.py
calculating shap values and fitting the model are the most memory intensive tasks -
Interestingly, as seen in
Scalability/Plot_ML_Model
, there is a growth and decline in memory usage for Run_Ml_Models but, no decline for Dashboard. -
You can reproduce these results by reading the comements in the 'Dashboard.py' file.
-
You will simply need to uncomment 2 lines to be able to reproduce these results.
-
Reference: https://pypi.org/project/memory-profiler/
-
Reference: https://github.com/pythonprofilers/memory_profiler
- Time Profiling:
-
We use the cProfile package for doing time profiling
-
The results are stored in
Scalability/Time_Profiling
-
To reproduce, simply follow the instructions at the bottom of the code for
Dashboard.py
andRun_Ml_Models.py
-
you can interpret the results using snakeviz as mentioned there
-
It shows the time required to load the dashboard completely along with breakdown of time required by different components
-
It shows the time required to run the model and get the shaply values with visualisation with breakdown
Access the Dashboard at link:
Deployed dashboard link: Dashboard ("The Server is free and hence needs to restart after giving sometime to reload, will buy a paid server for better deployment in next version")
Security and License Please read the License to ethically and safely reproduce the repository. Please read Security policy to report any security issues. Please report any Issues in the issues section and I will try to fix it soon.