Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append lose data : by default remove duplicted indices. #65

Open
eromoe opened this issue Dec 10, 2022 · 1 comment
Open

Append lose data : by default remove duplicted indices. #65

eromoe opened this issue Dec 10, 2022 · 1 comment

Comments

@eromoe
Copy link

eromoe commented Dec 10, 2022

I have a lot of csv need import to store .
But the dataset doesn't increase .

On my testing, append always overwrite data with a large index ,

For example,

  • df1 have index np.arange(10) and other 4 columns
  • df2 have index np.arange(12) and other 4 columns
  • df3 have index np.arange(11) and other 4 columns

There is no duplicates in df1,df2 and df3 except some index .

item.write(df1)
item.append(df2)
item.append(df3)

Finally, item size is same as df2.

After some digging , I found pystore with data = data[~data.index.isin(old_index)] , only insert new index !
I think this is a bad assumption, user wouldn't know unless he review the code.

def append(...)
      ...
      try:
          if epochdate or ("datetime" in str(data.index.dtype) and
                           any(data.index.nanosecond) > 0):
              data = utils.datetime_to_int64(data)
          old_index = dd.read_parquet(self._item_path(item, as_string=True),
                                      columns=[], engine=self.engine
                                      ).index.compute()
          data = data[~data.index.isin(old_index)]
      except Exception:
          return

Append should never remove any row by default , only if user require, that is plain meaning of append.

@eromoe eromoe changed the title Append lose data by default concern on index value is a problem. Append lose data : by default remove duplicted indices. Dec 10, 2022
@gnzsnz
Copy link
Contributor

gnzsnz commented Aug 16, 2023

try using timeseries. you need a dataframe with and index that is a date, or datetime.

if you use timeseries append works just fine.

example, today i download SPY ticker price history and store it.

tomorrow download SPY ticker price history. append will just add a new day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants