Skip to content

Added part of data from Alfalfa catalog to the test server #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions alfalfa/get_column_info.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
import pandas as pd
import numpy as np


df = pd.read_csv("./alfalfa/tables/raw_info.csv", sep=" ", engine="python")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

давай сделаем это параметром из командной строки? чтобы если путь поменялся можно было бы легко поменять а не по коду искать

хорошая и простая библиотека для таких штук click
https://click.palletsprojects.com/en/8.1.x/quickstart/#basic-concepts-creating-a-command
https://click.palletsprojects.com/en/8.1.x/quickstart/#adding-parameters

df.rename(columns={
"Units": "unit",
"Label": "name",
"Explanations": "description"}
, inplace=True)
df = df[["name", "unit","description"]]

def check_nans(row) -> str:
if row["unit"] == "---":
return np.nan
return row["unit"]

df['unit'] = df.apply(check_nans, axis=1)

def escape_percent(value):
if isinstance(value, str):
return value.replace('%', '%%')
return value

df['description'] = df['description'].apply(escape_percent)


# also replacing dots with NaN in catalog data
def check_nans_in_df(row) -> str:
if row["Name"] == "........":
return np.nan
return row["Name"]

data = pd.read_csv("./alfalfa/tables/main_data.csv")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

и это тоже давай аргументом командной строки

data['Name'] = data.apply(check_nans_in_df, axis=1)
data.to_csv(f"./alfalfa/tables/main_data.csv", index=False)


table_columns = data.dtypes
table_columns = pd.DataFrame({'name':table_columns.index, 'data_type':table_columns.values})
table_columns = table_columns.replace({
"int64": "int",
"float64": "float",
"object": "str",
})

table_columns = pd.merge(table_columns, df, on="name", how="left")
table_columns.to_csv(f"./alfalfa/tables/main_info.csv", index=False)
71 changes: 71 additions & 0 deletions alfalfa/load_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
import pandas as pd
import hyperleda
import os
import psycopg2



ALFALFA_BIBCODE = "2018ApJ...861...49H" # bibcode for ALFALFA 2018 article from ads and offisial website

conn = psycopg2.connect(
host= os.getenv("HYPERLEDA_DB_HOST"),
database=os.getenv("HYPERLEDA_DB_DATABASE"),
user=os.getenv("HYPERLEDA_DB_USER"),
password=os.getenv("HYPERLEDA_DB_PASSWORD"),
port=os.getenv("HYPERLEDA_DB_PORT")
)

client = hyperleda.HyperLedaClient(endpoint=hyperleda.TEST_ENDPOINT)


def del_nans(row):
return {k:v for k,v in row.items() if v == v}

def leda_dtyper(row) -> str:
return hyperleda.DataType(row["data_type"])


# getting columns info
table_columns = pd.read_csv(f"./alfalfa/tables/main_info.csv")
table_columns["data_type"] = table_columns.apply(leda_dtyper, axis=1)
table_dict = table_columns.to_dict("records")

# table creation
table_name = f"alfalfa_hi_source_catalog"

table_id = client.create_table(
hyperleda.CreateTableRequestSchema(
table_name=table_name,
columns=[
hyperleda.ColumnDescription(**del_nans(column)) for column in table_dict
],
bibcode=ALFALFA_BIBCODE,
)
)

print(f"Created table '{table_name}' with ID: {table_id}")

# reading all data from alfalfa catalog
df = pd.read_csv("./alfalfa/tables/main_data.csv")

offset = 0
batch = 500
test_limit = 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

тут же вроде не надо, строк-то не оч много?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

если сразу всю табличку закидывать, мне все равно будет банить с request entity too large

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

я максимум ~3к объектов за раз могу грузить

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

тут речь не про батч, а про test_limit)


while offset <= test_limit:
data = df.iloc[offset:offset+batch]

if data.empty:
break

print(data)
client.add_data(table_id, data)

print(f"Added {data.shape[0]} rows to the table {table_name}. In total {offset + batch} rows")

offset += batch

print(f"Added all data to the table '{table_name}'")

conn.close()

31,503 changes: 31,503 additions & 0 deletions alfalfa/tables/main_data.csv

Large diffs are not rendered by default.

20 changes: 20 additions & 0 deletions alfalfa/tables/main_info.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name,data_type,unit,description
AGCNr,int,,Entry number in catalog
Name,str,,Common name
RAdeg_HI,float,,
DECdeg_HI,float,,
RAdeg_OC,float,,
DECdeg_OC,float,,
Vhelio,int,km/s,Heliocentric velocity of the HI profile midpoint
W50,int,km/s,Observed velocity width at 50%% of peak on either side
sigW,int,,
W20,int,km/s,Observed velocity width at 20%% of peak on either side
HIflux,float,Jy.km/s,HI line flux density
sigflux,float,Jy.km/s,Uncertainty in HIflux
SNR,float,,Ratio of peak flux to rms noise
RMS,float,mJy,The RMS noise in the extracted spectrum at 10 km/s resolution
Dist,float,Mpc,"Adopted distance, where applicable"
sigDist,float,,
logMH,float,,
siglogMH,float,,
HIcode,int,,HI source code (2)
18 changes: 18 additions & 0 deletions alfalfa/tables/raw_info.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
Bytes Format Units Label Explanations
1-6 I6 --- AGCNr Entry number in catalog
8-15 A8 --- Name Common name
17-31 A15 --- PosHI Position (J2000) of HI centroid (1)
33-47 A15 --- PosOC Position (J2000) of optical counterpart, where applicable (1)
49-53 I5 km/s Vhelio Heliocentric velocity of the HI profile midpoint
55-57 I3 km/s W50 Observed velocity width at 50% of peak on either side
59-61 I3 km/s sigW50 Uncertainty in W50
63-65 I3 km/s W20 Observed velocity width at 20% of peak on either side
67-72 F7.2 Jy.km/s HIflux HI line flux density
74-77 F4.2 Jy.km/s sigflux Uncertainty in HIflux
79-83 F5.1 --- SNR Ratio of peak flux to rms noise
85-89 F5.2 mJy RMS The RMS noise in the extracted spectrum at 10 km/s resolution
91-95 F5.1 Mpc Dist Adopted distance, where applicable
97-100 F4.1 Mpc sigD Uncertainty in distance, where applicable
102-106 F5.2 [solMass] logMHI HI mass in logarithmic solar units, where distance has been adopted
108-111 F4.1 [solMass] sigMHI Uncertainty in logMHI
113 I1 --- HIcode HI source code (2)
4 changes: 1 addition & 3 deletions hyperleda/get_column_info.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

и тут тоже давай пути сделаем click-ом сразу

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Кстати параметры базы данных тоже можно туда утащить через переменные окружения: https://click.palletsprojects.com/en/8.1.x/arguments/#environment-variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

а их кликом доставать или как было: host=os.getenv("HYPERLEDA_DB_HOST") и тд?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

клик умеет в переменные окружения, так что как будто можно сразу кликом и просто в переменную положить

Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
import hyperleda
import numpy as np
import pandas as pd
import psycopg2

Expand Down Expand Up @@ -78,6 +76,6 @@ def ucd_fix_stat_error(row):
table_columns = pd.merge(table_columns, df, on="column_name", how="left")
table_columns.rename(columns={"column_name": "name"}, inplace=True)

table_columns.to_csv(f"./tables/{table_name}_info.csv", index=False)
table_columns.to_csv(f"./hyperleda/tables/{table_name}_info.csv", index=False)

conn.close()
File renamed without changes.