Cutouts Training Code #141

edoardolegnaro · 2024-08-28T15:18:00Z

training code for cutouts

PaulJWright · 2024-08-28T15:45:20Z

A few things.

Please ensure sure CI passes. You should run pre-commit on these files.
the requirements.txt needs to be moved and incorporated with the project requirements. Potentially we might want this under something like training in setup.cfg, so people can pip install arccnet[training] or arccnet[all]
- NB, you only want to include what you actually installed, the current requirements.txt seems to be the result of pip freeze, which includes the dependency tree
Is there currently an implementation of inference? E.g. pulling the best model from Comet, instantiating it, and then allowing someone to use it? If not, we need an entrypoint that allows someone to download the latest model and use it
Please look into the CLI implementation and suggest how you think this could be best handled for training.
- something like arccnet train ar_cutouts and then flags for certain parameters that overwrite what's currently in config.py
- and arccnet inference ar_cutouts that retrieves the model from the cloud, instantiates it, and provides and inference

I would recommend following the dev instructions here: https://github.com/ARCAFF/ARCCnet that outline forking, cloning, installing.

PaulJWright · 2024-08-28T16:15:57Z

arccnet/models/cutouts/config.py

+        v2.RandomHorizontalFlip(),
+        v2.RandomVerticalFlip(),


How valid is this without also flipping magnetic field?

PaulJWright · 2024-08-28T16:16:51Z

arccnet/models/utilities.py

+def make_dataframe(
+    data_folder="../../data/",
+    dataset_folder="arccnet-cutout-dataset-v20240715",
+    file_name="cutout-mcintosh-catalog-v20240715.parq",
+):
+    """
+    Processes the ARCCNet cutout dataset by loading a parquet file, converting Julian dates to datetime objects,
+    filtering out problematic magnetograms, and categorizing the regions based on their magnetic class or type.
+
+    Parameters:
+    - data_folder (str): The base directory where the dataset folder is located. Default is '../../data/'.
+    - dataset_folder (str): The folder containing the dataset. Default is 'arccnet-cutout-dataset-v20240715'.
+    - file_name (str): The name of the parquet file to read. Default is 'cutout-mcintosh-catalog-v20240715.parq'.
+
+    Returns:
+    - df (pd.DataFrame): The processed DataFrame containing all regions with additional date and label columns.
+    - AR_df (pd.DataFrame): A DataFrame filtered to include only active regions (AR) and intermediate regions (IA).
+    """
+    # Set the data folder using environment variable or default
+    data_folder = os.getenv("ARCAFF_DATA_FOLDER", data_folder)
+
+    # Read the parquet file
+    df = pd.read_parquet(os.path.join(data_folder, dataset_folder, file_name))
+
+    # Convert Julian dates to datetime objects
+    df["time"] = df["target_time.jd1"] + df["target_time.jd2"]
+    times = Time(df["time"], format="jd")
+    dates = pd.to_datetime(times.iso)  # Convert to datetime objects
+    df["dates"] = dates
+
+    # Remove problematic magnetograms from the dataset
+    problematic_quicklooks = ["20010116_000028_MDI.png", "20001130_000028_MDI.png", "19990420_235943_MDI.png"]
+
+    filtered_df = []
+    for ql in problematic_quicklooks:
+        row = df["quicklook_path_mdi"] == "quicklook/" + ql
+        filtered_df.append(df[row])
+    filtered_df = pd.concat(filtered_df)
+    df = df.drop(filtered_df.index).reset_index(drop=True)
+
+    # Label the data
+    df["label"] = np.where(df["magnetic_class"] == "", df["region_type"], df["magnetic_class"])
+    df["date_only"] = df["dates"].dt.date
+
+    # Filter AR and IA regions
+    AR_df = pd.concat([df[df["region_type"] == "AR"], df[df["region_type"] == "IA"]])
+
+    return df, AR_df


can a bunch of this not be replaced with QTable.read(file.parq).to_pandas()?

PaulJWright · 2024-08-28T16:17:58Z

arccnet/models/utilities.py

+
+
+### NN Training ###
+class FITSDataset(Dataset):


We probably want Dataset/DataLoaders/Models in submodules?

The thought here is we may want to end up using WebDataset

PaulJWright · 2024-08-28T16:18:17Z

arccnet/models/utilities.py

+            replace_activations(child, old_act, new_act, **kwargs)
+
+
+def generate_run_id(config):


Does Comet not do this automatically?

PaulJWright · 2024-08-28T16:19:30Z

arccnet/models/utilities.py

+    return (avg_test_loss, test_accuracy, test_precision, test_recall, test_f1, cm_test, report_df)
+
+
+### IMAGES ###


If it requires a header it probably deserves to be in a separate submodule

samaloney

Lots of good stuff here

The main changes that I think are needed:

Integration with the existing CLU
Integration with the existing configuration handling
Split up the utilities.py file

samaloney · 2024-09-05T12:56:01Z

setup.cfg

    protobuf==3.20.1
    scikit-learn==1.3.0
+train =
+    astropy==6.0.1


So need to think about how we mange this as astropy and scikit_learn are in the base install_requires requires section and could have different versions

samaloney · 2024-09-05T12:57:35Z

arccnet/models/cutouts/inference.py

+    script_dir = Path(__file__).parent.resolve()
+    output_dir = script_dir.parent / "trained_models"
+    output_dir.mkdir(parents=True, exist_ok=True)
+    model_path = output_dir / f"{args.model_name}-{args.model_version}.pth"


I think should be using the generic configuration stuff here

samaloney · 2024-09-05T13:00:01Z

arccnet/models/utilities.py

+    data_folder = os.getenv("ARCAFF_DATA_FOLDER", data_folder)
+
+    # Read the parquet file
+    df = pd.read_parquet(os.path.join(data_folder, dataset_folder, file_name))


I'd suggest using astropy table tab = Table.read() then tab.to_pandas()

samaloney · 2024-09-05T13:05:05Z

arccnet/models/utilities.py

This is a very large file with lot of functions doing very different things I think it needs to be split up in to a few files/modules.

samaloney · 2024-10-14T10:23:28Z

Can this be closed in favour of #143, #144 and #145 @edoardolegnaro?

edoardolegnaro · 2024-10-14T10:48:24Z

sure

added training code

0099b70

edoardolegnaro requested review from PaulJWright and samaloney August 28, 2024 15:18

edoardolegnaro changed the title ~~Training Code~~ Cutouts Training Code Aug 28, 2024

PaulJWright reviewed Aug 28, 2024

View reviewed changes

PaulJWright requested changes Aug 28, 2024

View reviewed changes

edoardolegnaro added 2 commits August 29, 2024 14:49

fixed precommit and logged confusion matrix names

4ec7f14

added inference code and flags for training

c4d1294

samaloney reviewed Sep 5, 2024

View reviewed changes

edoardolegnaro closed this Oct 14, 2024

edoardolegnaro deleted the training-code branch March 25, 2025 10:24

		replace_activations(child, old_act, new_act, **kwargs)


		def generate_run_id(config):

		return (avg_test_loss, test_accuracy, test_precision, test_recall, test_f1, cm_test, report_df)


		### IMAGES ###

Cutouts Training Code #141

Cutouts Training Code #141

Uh oh!

Conversation

edoardolegnaro commented Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PaulJWright commented Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PaulJWright Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samaloney left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samaloney commented Oct 14, 2024

Uh oh!

edoardolegnaro commented Oct 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

edoardolegnaro commented Aug 28, 2024 •

edited

Loading

PaulJWright commented Aug 28, 2024 •

edited

Loading

PaulJWright Aug 28, 2024 •

edited

Loading