Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Casting a Date to String results in a result different from Spark CPU #88

Open
razajafri opened this issue Jun 2, 2020 · 2 comments
Labels
bug Something isn't working P2 Not required for release SQL part of the SQL/Dataframe plugin

Comments

@razajafri
Copy link
Collaborator

Describe the bug
When casting a Date to string doesn't always match the CPU result

Steps/Code to reproduce bug

import session.sqlContext.implicits._
val df = Seq(
      Date.valueOf("9999-12-31"),
      Date.valueOf("0211-1-1"),
      Date.valueOf("1900-2-2"),
      Date.valueOf("1989-3-3"),
      Date.valueOf("2010-4-4"),
      Date.valueOf("2020-5-5"),
      Date.valueOf("2050-10-30"),
    ).toDF("dates")
 
df.selectExpr("string(date_add(dates, 1))")

The above code results in

[0000-01-01], [0211-01-02], [1900-02-03], [1989-03-04], [2010-04-05], [2020-05-06], [2050-10-31]

Expected behavior
It should result in the following

[+10000-01-01], [0211-01-02], [1900-02-03], [1989-03-04], [2010-04-05], [2020-05-06], [2050-10-31]
@razajafri razajafri added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jun 2, 2020
@revans2
Copy link
Collaborator

revans2 commented Jun 2, 2020

So it looks like the issue is really only in printing a > 4 digit year. Do we have a cudf issue filed for this? Is spark actually inserting a + in front of the 10000 for the year?

@revans2 revans2 added the SQL part of the SQL/Dataframe plugin label Jun 2, 2020
@revans2 revans2 removed the ? - Needs Triage Need team to review and classify label Jun 2, 2020
@razajafri
Copy link
Collaborator Author

razajafri commented Jun 10, 2020

Yes, This is how it's working in Spark I think.

0001-01-01 00:00:00 is the first day of year 1.
9999-12-31 23:59:59 is the last day of year 9999 and millenia

after the millenia has ended spark adds a +1 to the year for the subsequent millenia.

+10001-01-01 00:00:00 is the first day of this millenia
+19999-12-31 23:59:59 is the last day of this millenia

This goes up until +294247-01-09 20:00:54 after which the long will overflow

So to do this in Spark we will have to know the boundaries of the milllenias in epoch and check the column for any values greater than the boundary and keep adding one until we don't find any more values that are greater than the value

here are the epochs for the millenias starting from -290308 upto +294247

starting epoch of millenia ending epoch of millenia prepended with
-9223372036854.00 -9213651648423.00 -29
-9213651648422.00 -8898082128423.00 -28
-8898082128422.00 -8582512608423.00 -27
-8582512608422.00 -8266943088423.00 -26
-8266943088422.00 -7951373568423.00 -25
-7951373568422.00 -7635804048423.00 -24
-7635804048422.00 -7320234528423.00 -23
-7320234528422.00 -7004665008423.00 -22
-7004665008422.00 -6689095488423.00 -21
-6689095488422.00 -6373525968423.00 -20
-6373525968422.00 -6057956448423.00 -19
-6057956448422.00 -5742386928423.00 -18
-5742386928422.00 -5426817408423.00 -17
-5426817408422.00 -5111247888423.00 -16
-5111247888422.00 -4795678368423.00 -15
-4795678368422.00 -4480108848423.00 -14
-4480108848422.00 -4164539328423.00 -13
-4164539328422.00 -3848969808423.00 -12
-3848969808422.00 -3533400288423.00 -11
-3533400288422.00 -3217830768423.00 -10
-3217830768422.00 -2902261248423.00 -9
-2902261248422.00 -2586691728423.00 -8
-2586691728422.00 -2271122208423.00 -7
-2271122208422.00 -1955552688423.00 -6
-1955552688422.00 -1639983168423.00 -5
-1639983168422.00 -1324413648423.00 -4
-1324413648422.00 -1008844128423.00 -3
-1008844128422.00 -693274608423.00 -2
-693274608422.00 -377705088423.00 -1
-377705088422.00 -62167190823.00 -
-62167190822.00 253402329599.00 na
253402329600.00 568971849599.00 +1
568971849600.00 884541369599.00 +2
884541369600.00 1200110889599.00 +3
1200110889600.00 1515680409599.00 +4
1515680409600.00 1831249929599.00 +5
1831249929600.00 2146819449599.00 +6
2146819449600.00 2462388969599.00 +7
2462388969600.00 2777958489599.00 +8
2777958489600.00 3093528009599.00 +9
3093528009600.00 3409097529599.00 +10
3409097529600.00 3724667049599.00 +11
3724667049600.00 4040236569599.00 +12
4040236569600.00 4355806089599.00 +13
4355806089600.00 4671375609599.00 +14
4671375609600.00 4986945129599.00 +15
4986945129600.00 5302514649599.00 +16
5302514649600.00 5618084169599.00 +17
5618084169600.00 5933653689599.00 +18
5933653689600.00 6249223209599.00 +19
6249223209600.00 6564792729599.00 +20
6564792729600.00 6880362249599.00 +21
6880362249600.00 7195931769599.00 +22
7195931769600.00 7511501289599.00 +23
7511501289600.00 7827070809599.00 +24
7827070809600.00 8142640329599.00 +25
8142640329600.00 8458209849599.00 +26
8458209849600.00 8773779369599.00 +27
8773779369600.00 9089348889599.00 +28
9089348889600.00 9223372036854.00 +29

@sameerz sameerz added the P2 Not required for release label Aug 25, 2020
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Signed-off-by: spark-rapids automation <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P2 Not required for release SQL part of the SQL/Dataframe plugin
Projects
None yet
Development

No branches or pull requests

3 participants