-
Notifications
You must be signed in to change notification settings - Fork 3
Cutouts Training Code #141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
A few things.
I would recommend following the dev instructions here: https://github.com/ARCAFF/ARCCnet that outline forking, cloning, installing. |
| v2.RandomHorizontalFlip(), | ||
| v2.RandomVerticalFlip(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How valid is this without also flipping magnetic field?
| def make_dataframe( | ||
| data_folder="../../data/", | ||
| dataset_folder="arccnet-cutout-dataset-v20240715", | ||
| file_name="cutout-mcintosh-catalog-v20240715.parq", | ||
| ): | ||
| """ | ||
| Processes the ARCCNet cutout dataset by loading a parquet file, converting Julian dates to datetime objects, | ||
| filtering out problematic magnetograms, and categorizing the regions based on their magnetic class or type. | ||
| Parameters: | ||
| - data_folder (str): The base directory where the dataset folder is located. Default is '../../data/'. | ||
| - dataset_folder (str): The folder containing the dataset. Default is 'arccnet-cutout-dataset-v20240715'. | ||
| - file_name (str): The name of the parquet file to read. Default is 'cutout-mcintosh-catalog-v20240715.parq'. | ||
| Returns: | ||
| - df (pd.DataFrame): The processed DataFrame containing all regions with additional date and label columns. | ||
| - AR_df (pd.DataFrame): A DataFrame filtered to include only active regions (AR) and intermediate regions (IA). | ||
| """ | ||
| # Set the data folder using environment variable or default | ||
| data_folder = os.getenv("ARCAFF_DATA_FOLDER", data_folder) | ||
|
|
||
| # Read the parquet file | ||
| df = pd.read_parquet(os.path.join(data_folder, dataset_folder, file_name)) | ||
|
|
||
| # Convert Julian dates to datetime objects | ||
| df["time"] = df["target_time.jd1"] + df["target_time.jd2"] | ||
| times = Time(df["time"], format="jd") | ||
| dates = pd.to_datetime(times.iso) # Convert to datetime objects | ||
| df["dates"] = dates | ||
|
|
||
| # Remove problematic magnetograms from the dataset | ||
| problematic_quicklooks = ["20010116_000028_MDI.png", "20001130_000028_MDI.png", "19990420_235943_MDI.png"] | ||
|
|
||
| filtered_df = [] | ||
| for ql in problematic_quicklooks: | ||
| row = df["quicklook_path_mdi"] == "quicklook/" + ql | ||
| filtered_df.append(df[row]) | ||
| filtered_df = pd.concat(filtered_df) | ||
| df = df.drop(filtered_df.index).reset_index(drop=True) | ||
|
|
||
| # Label the data | ||
| df["label"] = np.where(df["magnetic_class"] == "", df["region_type"], df["magnetic_class"]) | ||
| df["date_only"] = df["dates"].dt.date | ||
|
|
||
| # Filter AR and IA regions | ||
| AR_df = pd.concat([df[df["region_type"] == "AR"], df[df["region_type"] == "IA"]]) | ||
|
|
||
| return df, AR_df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can a bunch of this not be replaced with QTable.read(file.parq).to_pandas()?
|
|
||
|
|
||
| ### NN Training ### | ||
| class FITSDataset(Dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want Dataset/DataLoaders/Models in submodules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thought here is we may want to end up using WebDataset
| replace_activations(child, old_act, new_act, **kwargs) | ||
|
|
||
|
|
||
| def generate_run_id(config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Comet not do this automatically?
| return (avg_test_loss, test_accuracy, test_precision, test_recall, test_f1, cm_test, report_df) | ||
|
|
||
|
|
||
| ### IMAGES ### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it requires a header it probably deserves to be in a separate submodule
samaloney
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of good stuff here
The main changes that I think are needed:
- Integration with the existing CLU
- Integration with the existing configuration handling
- Split up the utilities.py file
| protobuf==3.20.1 | ||
| scikit-learn==1.3.0 | ||
| train = | ||
| astropy==6.0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So need to think about how we mange this as astropy and scikit_learn are in the base install_requires requires section and could have different versions
| script_dir = Path(__file__).parent.resolve() | ||
| output_dir = script_dir.parent / "trained_models" | ||
| output_dir.mkdir(parents=True, exist_ok=True) | ||
| model_path = output_dir / f"{args.model_name}-{args.model_version}.pth" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think should be using the generic configuration stuff here
| data_folder = os.getenv("ARCAFF_DATA_FOLDER", data_folder) | ||
|
|
||
| # Read the parquet file | ||
| df = pd.read_parquet(os.path.join(data_folder, dataset_folder, file_name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest using astropy table tab = Table.read() then tab.to_pandas()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very large file with lot of functions doing very different things I think it needs to be split up in to a few files/modules.
|
Can this be closed in favour of #143, #144 and #145 @edoardolegnaro? |
|
sure |
training code for cutouts