Skip to content

Conversation

@sangram11
Copy link

@sangram11 sangram11 commented Jul 22, 2025

Earlier users were not able to generate timestamp data using FakeDataSource. In this change users can generate fake records having timestamp data also.

I have used datetime + random.randint to generate fake records until one last year.

Summary by CodeRabbit

  • New Features
    • Added support for generating random timestamp values in fake data sources, allowing schemas to include timestamp fields.
  • Tests
    • Introduced tests to validate the correct handling and generation of timestamp columns in the fake data source.

@coderabbitai
Copy link

coderabbitai bot commented Jul 22, 2025

Walkthrough

The fake data source now supports generating columns of type TimestampType in addition to StringType. A new helper class generates random UTC datetimes for these fields. The schema validation and data generation logic are updated accordingly. Tests are added to validate timestamp column support and correct schema handling.

Changes

Cohort / File(s) Change Summary
Fake Data Source Timestamp Support
pyspark_datasources/fake.py
Added support for TimestampType fields in fake data source; introduced GenerateDateTime class for generating random datetimes; updated schema validation and data generation logic for both batch and streaming readers.
Tests for Timestamp Columns
tests/test_data_sources.py
Added imports for Spark SQL types; introduced test_fake_timestamp_column to validate timestamp column support; enhanced existing streaming test with column count assertion.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant FakeDataSourceReader
    participant GenerateDateTime

    User->>FakeDataSourceReader: read(partition)
    loop For each field in schema
        alt field is StringType
            FakeDataSourceReader->>FakeDataSourceReader: generate string value
        else field is TimestampType
            FakeDataSourceReader->>GenerateDateTime: random_datetime()
            GenerateDateTime-->>FakeDataSourceReader: random UTC datetime
        end
    end
    FakeDataSourceReader-->>User: Return generated row(s)
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Poem

In fields of data, strings once reigned,
Now timestamps join, their power unchained.
With random dates from UTC skies,
The fake source grows both smart and wise.
Tests confirm the schema’s might—
Hopping forward, code feels right! 🐇⏱️

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
pyspark_datasources/fake.py (1)

175-184: Update stream reader to handle TimestampType fields.

The stream reader doesn't support TimestampType fields, creating inconsistency with the batch reader. This will cause failures when using streaming with schemas containing TimestampType fields.

             for field in self.schema.fields:
-                value = getattr(fake, field.name)()
+                value = getattr(fake, field.name)() if field.dataType == StringType() else GenerateDateTime.randomDate()
                 row.append(value)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ecaa405 and bc9faa1.

📒 Files selected for processing (1)
  • pyspark_datasources/fake.py (3 hunks)
🪛 Ruff (0.12.2)
pyspark_datasources/fake.py

29-32: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


153-153: Do not call getattr with a constant attribute value. It is not any safer than normal property access.

Replace getattr with attribute access

(B009)

🧰 Additional context used
🪛 Ruff (0.12.2)
pyspark_datasources/fake.py

29-32: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


153-153: Do not call getattr with a constant attribute value. It is not any safer than normal property access.

Replace getattr with attribute access

(B009)

🔇 Additional comments (2)
pyspark_datasources/fake.py (2)

2-3: LGTM! Imports are necessary and properly used.

The added imports support the new timestamp functionality correctly.

Also applies to: 11-11


153-153: Simplify randomDate call and add timestamp support in stream reader

The code still has two issues:

  • In FakeDataSourceReader.read, calling
    getattr(GenerateDateTime, 'randomDate')() is needlessly indirect.
  • In FakeDataSourceStreamReader.read, every field always uses the Faker provider—even for TimestampType, so timestamp columns will never get a date.

Please apply these diffs in pyspark_datasources/fake.py:

--- a/pyspark_datasources/fake.py
+++ b/pyspark_datasources/fake.py
@@ class FakeDataSourceReader(DataSourceReader):
-        for field in self.schema.fields:
-            value = getattr(fake, field.name)() if field.dataType == StringType() else getattr(GenerateDateTime, 'randomDate')()
+        for field in self.schema.fields:
+            if field.dataType == StringType():
+                value = getattr(fake, field.name)()
+            else:  # TimestampType
+                value = GenerateDateTime.randomDate()
             row.append(value)
@@ class FakeDataSourceStreamReader(DataSourceStreamReader):
-        for _ in range(partition.value):
-            row = []
-            for field in self.schema.fields:
-                value = getattr(fake, field.name)()
-                row.append(value)
-            yield tuple(row)
+        for _ in range(partition.value):
+            row = []
+            for field in self.schema.fields:
+                if field.dataType == StringType():
+                    value = getattr(fake, field.name)()
+                else:  # TimestampType
+                    value = GenerateDateTime.randomDate()
+                row.append(value)
+            yield tuple(row)

These changes remove unnecessary getattr on a constant string and ensure your stream reader handles TimestampType the same way as the batch reader.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
pyspark_datasources/fake.py (1)

175-184: Critical: Update streaming reader to support TimestampType.

The streaming reader's read method hasn't been updated to handle TimestampType fields, creating an inconsistency with the batch reader. This will cause failures when using TimestampType fields with streaming.

Apply this diff to add TimestampType support to the streaming reader:

             for field in self.schema.fields:
-                value = getattr(fake, field.name)()
+                value = getattr(fake, field.name)() if field.dataType == StringType() else GenerateDateTime.random_datetime()
                 row.append(value)
♻️ Duplicate comments (1)
pyspark_datasources/fake.py (1)

39-43: Update method name as discussed in previous reviews.

The implementation correctly uses microseconds parameter, but the method name should be updated from randomDate to random_datetime as agreed in previous review discussions. You'll also need to update the method call in line 153.

-    def randomDate(cls):
+    def random_datetime(cls):
🧹 Nitpick comments (1)
pyspark_datasources/fake.py (1)

153-153: Simplify method call and update method name reference.

Replace the getattr call with direct method access for better readability, and update the method name to match the planned rename.

-                value = getattr(fake, field.name)() if field.dataType == StringType() else getattr(GenerateDateTime, 'randomDate')()
+                value = getattr(fake, field.name)() if field.dataType == StringType() else GenerateDateTime.random_datetime()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bc9faa1 and 1908e27.

📒 Files selected for processing (1)
  • pyspark_datasources/fake.py (3 hunks)
🪛 Ruff (0.12.2)
pyspark_datasources/fake.py

29-32: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


153-153: Do not call getattr with a constant attribute value. It is not any safer than normal property access.

Replace getattr with attribute access

(B009)

🧰 Additional context used
🪛 Ruff (0.12.2)
pyspark_datasources/fake.py

29-32: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


153-153: Do not call getattr with a constant attribute value. It is not any safer than normal property access.

Replace getattr with attribute access

(B009)

🔇 Additional comments (2)
pyspark_datasources/fake.py (2)

2-3: LGTM! Import statements are correct.

The new imports are properly added to support timestamp generation functionality.

Also applies to: 11-11


24-37: LGTM! Schema validation correctly handles both supported types.

The validation logic properly:

  • Validates faker methods for StringType fields
  • Skips validation for TimestampType fields (as intended)
  • Provides accurate error messages reflecting both supported types

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
pyspark_datasources/fake.py (1)

175-184: Critical: Streaming reader missing TimestampType support.

The streaming reader only handles StringType fields via faker methods but lacks support for TimestampType fields. This creates inconsistent behavior where batch reading works with timestamps but streaming fails.

Apply this diff to add TimestampType support to the streaming reader:

            for field in self.schema.fields:
-                value = getattr(fake, field.name)()
+                value = getattr(fake, field.name)() if field.dataType == StringType() else GenerateDateTime.random_datetime()
                row.append(value)
🧹 Nitpick comments (1)
pyspark_datasources/fake.py (1)

153-153: Simplify the constant attribute access for timestamp generation.

The logic correctly handles both data types, but the getattr call with a constant attribute name can be simplified for better readability and performance.

-                value = getattr(fake, field.name)() if field.dataType == StringType() else getattr(GenerateDateTime, 'random_datetime')()
+                value = getattr(fake, field.name)() if field.dataType == StringType() else GenerateDateTime.random_datetime()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1908e27 and f092649.

📒 Files selected for processing (1)
  • pyspark_datasources/fake.py (3 hunks)
🧠 Learnings (2)
📓 Common learnings
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
pyspark_datasources/fake.py (1)

Learnt from: sangram11
PR: #16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.

🪛 Ruff (0.12.2)
pyspark_datasources/fake.py

29-32: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


153-153: Do not call getattr with a constant attribute value. It is not any safer than normal property access.

Replace getattr with attribute access

(B009)

🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
pyspark_datasources/fake.py (1)

Learnt from: sangram11
PR: #16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.

🪛 Ruff (0.12.2)
pyspark_datasources/fake.py

29-32: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


153-153: Do not call getattr with a constant attribute value. It is not any safer than normal property access.

Replace getattr with attribute access

(B009)

🔇 Additional comments (3)
pyspark_datasources/fake.py (3)

2-3: LGTM! Import statements are correctly added.

The new imports for datetime, timedelta, random, and TimestampType are necessary and properly placed to support the timestamp functionality.

Also applies to: 11-11


24-37: LGTM! Schema validation correctly handles both supported types.

The validation logic properly:

  • Validates faker methods for StringType fields
  • Skips validation for TimestampType fields (as they use the custom generator)
  • Enforces that only StringType and TimestampType are supported
  • Provides accurate error messaging

39-43: LGTM! DateTime generation implementation addresses previous feedback.

The implementation correctly:

  • Uses datetime.utcnow() for Python 3.10 compatibility (as per your preference)
  • Uses microseconds with proper range instead of the invalid milliseconds parameter
  • Has a clear method name random_datetime that accurately describes its functionality
  • Generates random timestamps within the past year as intended

@allisonwang-db
Copy link
Owner

@sangram11 thanks for adding the feature. Could you please also add a test case in tests/test_data_sources?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
pyspark_datasources/fake.py (1)

182-182: Simplify the getattr call for the constant method name.

Same issue as line 153 - the getattr call with a constant string should be replaced with direct method access for better readability.

-value = getattr(fake, field.name)() if field.dataType == StringType() else getattr(GenerateDateTime, 'random_datetime')()
+value = getattr(fake, field.name)() if field.dataType == StringType() else GenerateDateTime.random_datetime()
🧹 Nitpick comments (3)
pyspark_datasources/fake.py (2)

39-43: Consider simplifying the getattr call for consistency.

The GenerateDateTime class implementation looks good overall. The method correctly uses datetime.utcnow() for Python 3.10 compatibility and microseconds instead of milliseconds as addressed in previous reviews.

However, the static analysis tool flags the getattr usage in lines 153 and 182 as unnecessary since you're calling a constant method name.

Consider updating the data generation calls in lines 153 and 182 to use direct method calls:

-value = getattr(fake, field.name)() if field.dataType == StringType() else getattr(GenerateDateTime, 'random_datetime')()
+value = getattr(fake, field.name)() if field.dataType == StringType() else GenerateDateTime.random_datetime()

153-153: Simplify the getattr call for the constant method name.

While the logic is correct, the static analysis tool correctly identifies that using getattr with a constant string is unnecessary and less readable than direct method access.

-value = getattr(fake, field.name)() if field.dataType == StringType() else getattr(GenerateDateTime, 'random_datetime')()
+value = getattr(fake, field.name)() if field.dataType == StringType() else GenerateDateTime.random_datetime()
tests/test_data_sources.py (1)

5-5: Consider using explicit imports instead of star import.

While the star import works, it makes it harder to track which types are being used and can lead to namespace pollution. The static analysis tool correctly flags this as a potential issue.

Consider replacing the star import with explicit imports:

-from pyspark.sql.types import *
+from pyspark.sql.types import TimestampType

This makes the dependencies clearer and addresses the static analysis warnings about undefined names.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f092649 and 2f13cef.

📒 Files selected for processing (2)
  • pyspark_datasources/fake.py (5 hunks)
  • tests/test_data_sources.py (2 hunks)
🧠 Learnings (3)
📓 Common learnings
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
tests/test_data_sources.py (1)

Learnt from: sangram11
PR: #16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.

pyspark_datasources/fake.py (1)

Learnt from: sangram11
PR: #16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.

🪛 Ruff (0.12.2)
tests/test_data_sources.py

5-5: from pyspark.sql.types import * used; unable to detect undefined names

(F403)


36-36: TimestampType may be undefined, or defined from star imports

(F405)


40-40: FakeDataSource may be undefined, or defined from star imports

(F405)


46-46: TimestampType may be undefined, or defined from star imports

(F405)

pyspark_datasources/fake.py

29-32: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


153-153: Do not call getattr with a constant attribute value. It is not any safer than normal property access.

Replace getattr with attribute access

(B009)


182-182: Do not call getattr with a constant attribute value. It is not any safer than normal property access.

Replace getattr with attribute access

(B009)

🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
tests/test_data_sources.py (1)

Learnt from: sangram11
PR: #16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.

pyspark_datasources/fake.py (1)

Learnt from: sangram11
PR: #16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.

🪛 Ruff (0.12.2)
tests/test_data_sources.py

5-5: from pyspark.sql.types import * used; unable to detect undefined names

(F403)


36-36: TimestampType may be undefined, or defined from star imports

(F405)


40-40: FakeDataSource may be undefined, or defined from star imports

(F405)


46-46: TimestampType may be undefined, or defined from star imports

(F405)

pyspark_datasources/fake.py

29-32: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


153-153: Do not call getattr with a constant attribute value. It is not any safer than normal property access.

Replace getattr with attribute access

(B009)


182-182: Do not call getattr with a constant attribute value. It is not any safer than normal property access.

Replace getattr with attribute access

(B009)

🔇 Additional comments (7)
pyspark_datasources/fake.py (5)

2-3: LGTM! Appropriate imports for timestamp functionality.

The datetime and random imports are correctly added to support the new timestamp generation capability.


11-11: LGTM! TimestampType import added appropriately.

The TimestampType import is correctly added to support the new timestamp field validation and generation.


24-27: LGTM! Schema validation logic correctly handles TimestampType.

The validation logic appropriately:

  • Tests faker methods only for StringType fields
  • Skips validation for TimestampType fields since they use custom generation
  • Maintains existing functionality for string fields

33-37: LGTM! Error handling correctly updated for both supported types.

The validation and error message have been properly updated to reflect support for both StringType and TimestampType, as addressed in previous reviews.


127-127: LGTM! Default schema appropriately updated with timestamp field.

The default schema correctly includes the new creationDate timestamp field, expanding the functionality as intended.

tests/test_data_sources.py (2)

33-36: LGTM! Comprehensive test coverage for timestamp functionality in streaming.

The test correctly validates:

  • DataFrame has the expected 5 columns (including the new timestamp field)
  • TimestampType is present in the schema
  • Integration with streaming functionality works as expected

42-42: LGTM! Consistent test coverage for timestamp functionality in batch mode.

The test appropriately mirrors the streaming test by validating:

  • DataFrame schema includes TimestampType
  • Column count matches the updated schema (5 columns)
  • Batch reading functionality works correctly with timestamps

Also applies to: 45-46

@coderabbitai coderabbitai bot mentioned this pull request Jul 23, 2025
… getattr with directly calling the classmethod random_datetime. Addressed all comments by coderabbitai
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
.github/workflows/ci.yml (2)

40-46: Consolidate the two poetry install invocations
Running poetry install --no-root (dependencies only) and then another full poetry install forces two dependency-resolution passes and ignores the cache hit you just checked for. One install is enough:

-    - name: Install dependencies
-      if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
-      run: poetry install --no-interaction --no-root --extras "all"
-
-    - name: Install project
-      run: poetry install --no-interaction --extras "all"
+    - name: Install project (with deps)
+      if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
+      run: poetry install --no-interaction --extras "all"

This keeps the cache logic intact and shaves ~20-30 s per job.


52-52: Add a trailing newline
The YAML-lint warning points out the missing end-of-file newline.

-        poetry run pytest tests/ --cov=pyspark_datasources --cov-report=xml --cov-report=term-missing
+\        poetry run pytest tests/ --cov=pyspark_datasources --cov-report=xml --cov-report=term-missing
+

While harmless, fixing it silences the linter and keeps formatting consistent.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f5dc954 and 04e0763.

📒 Files selected for processing (1)
  • .github/workflows/ci.yml (1 hunks)
🧠 Learnings (1)
📓 Common learnings
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
🪛 YAMLlint (1.37.1)
.github/workflows/ci.yml

[error] 52-52: no new line character at the end of file

(new-line-at-end-of-file)

🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
🪛 YAMLlint (1.37.1)
.github/workflows/ci.yml

[error] 52-52: no new line character at the end of file

(new-line-at-end-of-file)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
.github/workflows/ci.yml (1)

47-52: Drop the duplicate test run—execute pytest once with coverage

The workflow still runs the full test suite twice, first without and then with coverage, doubling CI time for no gain. This was pointed out earlier and remains unresolved.

-    - name: Run tests
-      run: poetry run pytest tests/ -v
-
-    - name: Run tests with coverage
-      run: |
-        poetry run pytest tests/ --cov=pyspark_datasources --cov-report=xml --cov-report=term-missing
+    - name: Run tests with coverage
+      run: |
+        poetry run pytest tests/ -v \
+          --cov=pyspark_datasources \
+          --cov-report=xml \
+          --cov-report=term-missing
🧹 Nitpick comments (2)
.github/workflows/ci.yml (2)

53-53: Clean up trailing spaces & add terminal newline to satisfy YAML-lint
Minor formatting issues flagged by YAML-lint: remove the two trailing spaces on line 53 and ensure the file ends with a newline.

-        
+
+

33-39: Improve cache resiliency with restore-keys

If the exact hash misses (e.g., after updating poetry.lock), the job performs a full reinstall. Adding a broader restore-keys fallback keeps older caches useful and speeds up CI.

       uses: actions/cache@v4
       with:
         path: .venv
         key: venv-${{ runner.os }}-${{ matrix.python-version }}-${{ hashFiles('**/poetry.lock') }}
+        restore-keys: |
+          venv-${{ runner.os }}-${{ matrix.python-version }}-
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 04e0763 and 121cdab.

📒 Files selected for processing (1)
  • .github/workflows/ci.yml (1 hunks)
🧠 Learnings (1)
📓 Common learnings
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
🪛 YAMLlint (1.37.1)
.github/workflows/ci.yml

[error] 53-53: no new line character at the end of file

(new-line-at-end-of-file)


[error] 53-53: trailing spaces

(trailing-spaces)

🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
🪛 YAMLlint (1.37.1)
.github/workflows/ci.yml

[error] 53-53: no new line character at the end of file

(new-line-at-end-of-file)


[error] 53-53: trailing spaces

(trailing-spaces)

@sangram11
Copy link
Author

@sangram11 thanks for adding the feature. Could you please also add a test case in tests/test_data_sources?

Hi @allisonwang-db,
I have added my test cases for timestamp type. Please review and approve the change.
Thank you

@sangram11 sangram11 closed this Jul 24, 2025
@sangram11 sangram11 reopened this Jul 24, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
pyspark_datasources/fake.py (1)

29-32: Consider improving exception chaining.

The static analysis tool correctly identifies that the exception should include the original error context for better debugging.

        except AttributeError:
-            raise Exception(
+            raise Exception(
                f"Unable to find a method called `{field.name}` in faker. "
                f"Please check Faker's documentation to see supported methods."
-            )
+            ) from None
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 121cdab and b86ca07.

📒 Files selected for processing (2)
  • pyspark_datasources/fake.py (4 hunks)
  • tests/test_data_sources.py (3 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.695Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
📚 Learning: in pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(t...
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.695Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.

Applied to files:

  • pyspark_datasources/fake.py
  • tests/test_data_sources.py
🪛 Ruff (0.12.2)
pyspark_datasources/fake.py

29-32: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

tests/test_data_sources.py

46-46: FakeDataSource may be undefined, or defined from star imports

(F405)

🔇 Additional comments (7)
pyspark_datasources/fake.py (4)

2-3: LGTM: Clean import additions.

The new imports are correctly added and necessary for the timestamp functionality. The datetime and timedelta imports support the GenerateDateTime class, while TimestampType is needed for schema validation.

Also applies to: 11-11


24-37: Well-structured schema validation update.

The validation logic correctly handles both StringType and TimestampType:

  • StringType fields require a corresponding faker method
  • TimestampType fields skip faker validation (line 27)
  • Clear error messages for unsupported types

The implementation properly maintains backward compatibility while extending functionality.


39-43: Solid timestamp generation implementation.

The GenerateDateTime class is well-implemented:

  • Appropriate use of classmethod
  • Correct return type annotation
  • Generates timestamps within past year as intended
  • Uses microseconds parameter correctly (addressing past feedback)
  • Maintains datetime.utcnow() for Python 3.10 compatibility per user preference

153-153: Correct conditional logic for data generation.

Both reader classes properly handle the dual data types:

  • StringType fields use faker methods
  • TimestampType fields use the new GenerateDateTime.random_datetime()
  • Consistent implementation across batch and streaming readers

The ternary operator usage is clean and appropriate for this simple conditional logic.

Also applies to: 182-182

tests/test_data_sources.py (3)

5-5: Appropriate test imports.

The additional imports are necessary for creating explicit schemas with TimestampType fields in the tests. The imports are specific and purposeful.


33-34: Good test enhancement.

Adding the column count assertion strengthens the existing test by validating the expected schema structure. This helps ensure the default schema remains consistent.


45-56: Comprehensive timestamp column test.

The new test thoroughly validates the timestamp functionality:

  • Creates explicit schema with TimestampType field
  • Validates column names and count
  • Confirms the timestamp field has correct data type
  • Checks row generation works properly

The test covers all critical aspects of the new timestamp support and provides good regression protection.

@sangram11
Copy link
Author

Hi @allisonwang-db : Can you please review this pull request. It has been pending for a long time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants