Skip to content

Conversation

@edoardolegnaro
Copy link
Contributor

@edoardolegnaro edoardolegnaro commented Aug 28, 2024

training code for cutouts

@edoardolegnaro edoardolegnaro changed the title Training Code Cutouts Training Code Aug 28, 2024
@PaulJWright
Copy link
Member

PaulJWright commented Aug 28, 2024

A few things.

  • Please ensure sure CI passes. You should run pre-commit on these files.
  • the requirements.txt needs to be moved and incorporated with the project requirements. Potentially we might want this under something like training in setup.cfg, so people can pip install arccnet[training] or arccnet[all]
    • NB, you only want to include what you actually installed, the current requirements.txt seems to be the result of pip freeze, which includes the dependency tree
  • Is there currently an implementation of inference? E.g. pulling the best model from Comet, instantiating it, and then allowing someone to use it? If not, we need an entrypoint that allows someone to download the latest model and use it
  • Please look into the CLI implementation and suggest how you think this could be best handled for training.
    • something like arccnet train ar_cutouts and then flags for certain parameters that overwrite what's currently in config.py
    • and arccnet inference ar_cutouts that retrieves the model from the cloud, instantiates it, and provides and inference

I would recommend following the dev instructions here: https://github.com/ARCAFF/ARCCnet that outline forking, cloning, installing.

Comment on lines +36 to +37
v2.RandomHorizontalFlip(),
v2.RandomVerticalFlip(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How valid is this without also flipping magnetic field?

Comment on lines +63 to +110
def make_dataframe(
data_folder="../../data/",
dataset_folder="arccnet-cutout-dataset-v20240715",
file_name="cutout-mcintosh-catalog-v20240715.parq",
):
"""
Processes the ARCCNet cutout dataset by loading a parquet file, converting Julian dates to datetime objects,
filtering out problematic magnetograms, and categorizing the regions based on their magnetic class or type.
Parameters:
- data_folder (str): The base directory where the dataset folder is located. Default is '../../data/'.
- dataset_folder (str): The folder containing the dataset. Default is 'arccnet-cutout-dataset-v20240715'.
- file_name (str): The name of the parquet file to read. Default is 'cutout-mcintosh-catalog-v20240715.parq'.
Returns:
- df (pd.DataFrame): The processed DataFrame containing all regions with additional date and label columns.
- AR_df (pd.DataFrame): A DataFrame filtered to include only active regions (AR) and intermediate regions (IA).
"""
# Set the data folder using environment variable or default
data_folder = os.getenv("ARCAFF_DATA_FOLDER", data_folder)

# Read the parquet file
df = pd.read_parquet(os.path.join(data_folder, dataset_folder, file_name))

# Convert Julian dates to datetime objects
df["time"] = df["target_time.jd1"] + df["target_time.jd2"]
times = Time(df["time"], format="jd")
dates = pd.to_datetime(times.iso) # Convert to datetime objects
df["dates"] = dates

# Remove problematic magnetograms from the dataset
problematic_quicklooks = ["20010116_000028_MDI.png", "20001130_000028_MDI.png", "19990420_235943_MDI.png"]

filtered_df = []
for ql in problematic_quicklooks:
row = df["quicklook_path_mdi"] == "quicklook/" + ql
filtered_df.append(df[row])
filtered_df = pd.concat(filtered_df)
df = df.drop(filtered_df.index).reset_index(drop=True)

# Label the data
df["label"] = np.where(df["magnetic_class"] == "", df["region_type"], df["magnetic_class"])
df["date_only"] = df["dates"].dt.date

# Filter AR and IA regions
AR_df = pd.concat([df[df["region_type"] == "AR"], df[df["region_type"] == "IA"]])

return df, AR_df
Copy link
Member

@PaulJWright PaulJWright Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can a bunch of this not be replaced with QTable.read(file.parq).to_pandas()?



### NN Training ###
class FITSDataset(Dataset):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want Dataset/DataLoaders/Models in submodules?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thought here is we may want to end up using WebDataset

replace_activations(child, old_act, new_act, **kwargs)


def generate_run_id(config):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Comet not do this automatically?

return (avg_test_loss, test_accuracy, test_precision, test_recall, test_f1, cm_test, report_df)


### IMAGES ###
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it requires a header it probably deserves to be in a separate submodule

Copy link
Contributor

@samaloney samaloney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of good stuff here

The main changes that I think are needed:

  1. Integration with the existing CLU
  2. Integration with the existing configuration handling
  3. Split up the utilities.py file

protobuf==3.20.1
scikit-learn==1.3.0
train =
astropy==6.0.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So need to think about how we mange this as astropy and scikit_learn are in the base install_requires requires section and could have different versions

Comment on lines +55 to +58
script_dir = Path(__file__).parent.resolve()
output_dir = script_dir.parent / "trained_models"
output_dir.mkdir(parents=True, exist_ok=True)
model_path = output_dir / f"{args.model_name}-{args.model_version}.pth"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think should be using the generic configuration stuff here

data_folder = os.getenv("ARCAFF_DATA_FOLDER", data_folder)

# Read the parquet file
df = pd.read_parquet(os.path.join(data_folder, dataset_folder, file_name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest using astropy table tab = Table.read() then tab.to_pandas()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very large file with lot of functions doing very different things I think it needs to be split up in to a few files/modules.

@samaloney
Copy link
Contributor

Can this be closed in favour of #143, #144 and #145 @edoardolegnaro?

@edoardolegnaro
Copy link
Contributor Author

sure

@edoardolegnaro edoardolegnaro deleted the training-code branch March 25, 2025 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants