-
Notifications
You must be signed in to change notification settings - Fork 22
adding timestamp datatype along with string datatypes #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
WalkthroughThe fake data source now supports generating columns of type Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant FakeDataSourceReader
participant GenerateDateTime
User->>FakeDataSourceReader: read(partition)
loop For each field in schema
alt field is StringType
FakeDataSourceReader->>FakeDataSourceReader: generate string value
else field is TimestampType
FakeDataSourceReader->>GenerateDateTime: random_datetime()
GenerateDateTime-->>FakeDataSourceReader: random UTC datetime
end
end
FakeDataSourceReader-->>User: Return generated row(s)
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Poem
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
pyspark_datasources/fake.py (1)
175-184: Update stream reader to handle TimestampType fields.The stream reader doesn't support TimestampType fields, creating inconsistency with the batch reader. This will cause failures when using streaming with schemas containing TimestampType fields.
for field in self.schema.fields: - value = getattr(fake, field.name)() + value = getattr(fake, field.name)() if field.dataType == StringType() else GenerateDateTime.randomDate() row.append(value)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pyspark_datasources/fake.py(3 hunks)
🪛 Ruff (0.12.2)
pyspark_datasources/fake.py
29-32: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
153-153: Do not call getattr with a constant attribute value. It is not any safer than normal property access.
Replace getattr with attribute access
(B009)
🧰 Additional context used
🪛 Ruff (0.12.2)
pyspark_datasources/fake.py
29-32: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
153-153: Do not call getattr with a constant attribute value. It is not any safer than normal property access.
Replace getattr with attribute access
(B009)
🔇 Additional comments (2)
pyspark_datasources/fake.py (2)
2-3: LGTM! Imports are necessary and properly used.The added imports support the new timestamp functionality correctly.
Also applies to: 11-11
153-153: SimplifyrandomDatecall and add timestamp support in stream readerThe code still has two issues:
- In
FakeDataSourceReader.read, calling
getattr(GenerateDateTime, 'randomDate')()is needlessly indirect.- In
FakeDataSourceStreamReader.read, every field always uses the Faker provider—even forTimestampType, so timestamp columns will never get a date.Please apply these diffs in
pyspark_datasources/fake.py:--- a/pyspark_datasources/fake.py +++ b/pyspark_datasources/fake.py @@ class FakeDataSourceReader(DataSourceReader): - for field in self.schema.fields: - value = getattr(fake, field.name)() if field.dataType == StringType() else getattr(GenerateDateTime, 'randomDate')() + for field in self.schema.fields: + if field.dataType == StringType(): + value = getattr(fake, field.name)() + else: # TimestampType + value = GenerateDateTime.randomDate() row.append(value) @@ class FakeDataSourceStreamReader(DataSourceStreamReader): - for _ in range(partition.value): - row = [] - for field in self.schema.fields: - value = getattr(fake, field.name)() - row.append(value) - yield tuple(row) + for _ in range(partition.value): + row = [] + for field in self.schema.fields: + if field.dataType == StringType(): + value = getattr(fake, field.name)() + else: # TimestampType + value = GenerateDateTime.randomDate() + row.append(value) + yield tuple(row)These changes remove unnecessary
getattron a constant string and ensure your stream reader handlesTimestampTypethe same way as the batch reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
pyspark_datasources/fake.py (1)
175-184: Critical: Update streaming reader to support TimestampType.The streaming reader's
readmethod hasn't been updated to handle TimestampType fields, creating an inconsistency with the batch reader. This will cause failures when using TimestampType fields with streaming.Apply this diff to add TimestampType support to the streaming reader:
for field in self.schema.fields: - value = getattr(fake, field.name)() + value = getattr(fake, field.name)() if field.dataType == StringType() else GenerateDateTime.random_datetime() row.append(value)
♻️ Duplicate comments (1)
pyspark_datasources/fake.py (1)
39-43: Update method name as discussed in previous reviews.The implementation correctly uses
microsecondsparameter, but the method name should be updated fromrandomDatetorandom_datetimeas agreed in previous review discussions. You'll also need to update the method call in line 153.- def randomDate(cls): + def random_datetime(cls):
🧹 Nitpick comments (1)
pyspark_datasources/fake.py (1)
153-153: Simplify method call and update method name reference.Replace the
getattrcall with direct method access for better readability, and update the method name to match the planned rename.- value = getattr(fake, field.name)() if field.dataType == StringType() else getattr(GenerateDateTime, 'randomDate')() + value = getattr(fake, field.name)() if field.dataType == StringType() else GenerateDateTime.random_datetime()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pyspark_datasources/fake.py(3 hunks)
🪛 Ruff (0.12.2)
pyspark_datasources/fake.py
29-32: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
153-153: Do not call getattr with a constant attribute value. It is not any safer than normal property access.
Replace getattr with attribute access
(B009)
🧰 Additional context used
🪛 Ruff (0.12.2)
pyspark_datasources/fake.py
29-32: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
153-153: Do not call getattr with a constant attribute value. It is not any safer than normal property access.
Replace getattr with attribute access
(B009)
🔇 Additional comments (2)
pyspark_datasources/fake.py (2)
2-3: LGTM! Import statements are correct.The new imports are properly added to support timestamp generation functionality.
Also applies to: 11-11
24-37: LGTM! Schema validation correctly handles both supported types.The validation logic properly:
- Validates faker methods for StringType fields
- Skips validation for TimestampType fields (as intended)
- Provides accurate error messages reflecting both supported types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
pyspark_datasources/fake.py (1)
175-184: Critical: Streaming reader missing TimestampType support.The streaming reader only handles
StringTypefields via faker methods but lacks support forTimestampTypefields. This creates inconsistent behavior where batch reading works with timestamps but streaming fails.Apply this diff to add TimestampType support to the streaming reader:
for field in self.schema.fields: - value = getattr(fake, field.name)() + value = getattr(fake, field.name)() if field.dataType == StringType() else GenerateDateTime.random_datetime() row.append(value)
🧹 Nitpick comments (1)
pyspark_datasources/fake.py (1)
153-153: Simplify the constant attribute access for timestamp generation.The logic correctly handles both data types, but the
getattrcall with a constant attribute name can be simplified for better readability and performance.- value = getattr(fake, field.name)() if field.dataType == StringType() else getattr(GenerateDateTime, 'random_datetime')() + value = getattr(fake, field.name)() if field.dataType == StringType() else GenerateDateTime.random_datetime()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pyspark_datasources/fake.py(3 hunks)
🧠 Learnings (2)
📓 Common learnings
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
pyspark_datasources/fake.py (1)
Learnt from: sangram11
PR: #16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
🪛 Ruff (0.12.2)
pyspark_datasources/fake.py
29-32: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
153-153: Do not call getattr with a constant attribute value. It is not any safer than normal property access.
Replace getattr with attribute access
(B009)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
pyspark_datasources/fake.py (1)
Learnt from: sangram11
PR: #16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
🪛 Ruff (0.12.2)
pyspark_datasources/fake.py
29-32: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
153-153: Do not call getattr with a constant attribute value. It is not any safer than normal property access.
Replace getattr with attribute access
(B009)
🔇 Additional comments (3)
pyspark_datasources/fake.py (3)
2-3: LGTM! Import statements are correctly added.The new imports for
datetime,timedelta,random, andTimestampTypeare necessary and properly placed to support the timestamp functionality.Also applies to: 11-11
24-37: LGTM! Schema validation correctly handles both supported types.The validation logic properly:
- Validates faker methods for
StringTypefields- Skips validation for
TimestampTypefields (as they use the custom generator)- Enforces that only
StringTypeandTimestampTypeare supported- Provides accurate error messaging
39-43: LGTM! DateTime generation implementation addresses previous feedback.The implementation correctly:
- Uses
datetime.utcnow()for Python 3.10 compatibility (as per your preference)- Uses
microsecondswith proper range instead of the invalidmillisecondsparameter- Has a clear method name
random_datetimethat accurately describes its functionality- Generates random timestamps within the past year as intended
|
@sangram11 thanks for adding the feature. Could you please also add a test case in |
…hema in FakeDataSource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
pyspark_datasources/fake.py (1)
182-182: Simplify the getattr call for the constant method name.Same issue as line 153 - the
getattrcall with a constant string should be replaced with direct method access for better readability.-value = getattr(fake, field.name)() if field.dataType == StringType() else getattr(GenerateDateTime, 'random_datetime')() +value = getattr(fake, field.name)() if field.dataType == StringType() else GenerateDateTime.random_datetime()
🧹 Nitpick comments (3)
pyspark_datasources/fake.py (2)
39-43: Consider simplifying the getattr call for consistency.The GenerateDateTime class implementation looks good overall. The method correctly uses
datetime.utcnow()for Python 3.10 compatibility andmicrosecondsinstead ofmillisecondsas addressed in previous reviews.However, the static analysis tool flags the
getattrusage in lines 153 and 182 as unnecessary since you're calling a constant method name.Consider updating the data generation calls in lines 153 and 182 to use direct method calls:
-value = getattr(fake, field.name)() if field.dataType == StringType() else getattr(GenerateDateTime, 'random_datetime')() +value = getattr(fake, field.name)() if field.dataType == StringType() else GenerateDateTime.random_datetime()
153-153: Simplify the getattr call for the constant method name.While the logic is correct, the static analysis tool correctly identifies that using
getattrwith a constant string is unnecessary and less readable than direct method access.-value = getattr(fake, field.name)() if field.dataType == StringType() else getattr(GenerateDateTime, 'random_datetime')() +value = getattr(fake, field.name)() if field.dataType == StringType() else GenerateDateTime.random_datetime()tests/test_data_sources.py (1)
5-5: Consider using explicit imports instead of star import.While the star import works, it makes it harder to track which types are being used and can lead to namespace pollution. The static analysis tool correctly flags this as a potential issue.
Consider replacing the star import with explicit imports:
-from pyspark.sql.types import * +from pyspark.sql.types import TimestampTypeThis makes the dependencies clearer and addresses the static analysis warnings about undefined names.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pyspark_datasources/fake.py(5 hunks)tests/test_data_sources.py(2 hunks)
🧠 Learnings (3)
📓 Common learnings
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
tests/test_data_sources.py (1)
Learnt from: sangram11
PR: #16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
pyspark_datasources/fake.py (1)
Learnt from: sangram11
PR: #16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
🪛 Ruff (0.12.2)
tests/test_data_sources.py
5-5: from pyspark.sql.types import * used; unable to detect undefined names
(F403)
36-36: TimestampType may be undefined, or defined from star imports
(F405)
40-40: FakeDataSource may be undefined, or defined from star imports
(F405)
46-46: TimestampType may be undefined, or defined from star imports
(F405)
pyspark_datasources/fake.py
29-32: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
153-153: Do not call getattr with a constant attribute value. It is not any safer than normal property access.
Replace getattr with attribute access
(B009)
182-182: Do not call getattr with a constant attribute value. It is not any safer than normal property access.
Replace getattr with attribute access
(B009)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
tests/test_data_sources.py (1)
Learnt from: sangram11
PR: #16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
pyspark_datasources/fake.py (1)
Learnt from: sangram11
PR: #16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
🪛 Ruff (0.12.2)
tests/test_data_sources.py
5-5: from pyspark.sql.types import * used; unable to detect undefined names
(F403)
36-36: TimestampType may be undefined, or defined from star imports
(F405)
40-40: FakeDataSource may be undefined, or defined from star imports
(F405)
46-46: TimestampType may be undefined, or defined from star imports
(F405)
pyspark_datasources/fake.py
29-32: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
153-153: Do not call getattr with a constant attribute value. It is not any safer than normal property access.
Replace getattr with attribute access
(B009)
182-182: Do not call getattr with a constant attribute value. It is not any safer than normal property access.
Replace getattr with attribute access
(B009)
🔇 Additional comments (7)
pyspark_datasources/fake.py (5)
2-3: LGTM! Appropriate imports for timestamp functionality.The datetime and random imports are correctly added to support the new timestamp generation capability.
11-11: LGTM! TimestampType import added appropriately.The TimestampType import is correctly added to support the new timestamp field validation and generation.
24-27: LGTM! Schema validation logic correctly handles TimestampType.The validation logic appropriately:
- Tests faker methods only for StringType fields
- Skips validation for TimestampType fields since they use custom generation
- Maintains existing functionality for string fields
33-37: LGTM! Error handling correctly updated for both supported types.The validation and error message have been properly updated to reflect support for both StringType and TimestampType, as addressed in previous reviews.
127-127: LGTM! Default schema appropriately updated with timestamp field.The default schema correctly includes the new
creationDate timestampfield, expanding the functionality as intended.tests/test_data_sources.py (2)
33-36: LGTM! Comprehensive test coverage for timestamp functionality in streaming.The test correctly validates:
- DataFrame has the expected 5 columns (including the new timestamp field)
- TimestampType is present in the schema
- Integration with streaming functionality works as expected
42-42: LGTM! Consistent test coverage for timestamp functionality in batch mode.The test appropriately mirrors the streaming test by validating:
- DataFrame schema includes TimestampType
- Column count matches the updated schema (5 columns)
- Batch reading functionality works correctly with timestamps
Also applies to: 45-46
… getattr with directly calling the classmethod random_datetime. Addressed all comments by coderabbitai
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
.github/workflows/ci.yml (2)
40-46: Consolidate the twopoetry installinvocations
Runningpoetry install --no-root(dependencies only) and then another fullpoetry installforces two dependency-resolution passes and ignores the cache hit you just checked for. One install is enough:- - name: Install dependencies - if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true' - run: poetry install --no-interaction --no-root --extras "all" - - - name: Install project - run: poetry install --no-interaction --extras "all" + - name: Install project (with deps) + if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true' + run: poetry install --no-interaction --extras "all"This keeps the cache logic intact and shaves ~20-30 s per job.
52-52: Add a trailing newline
The YAML-lint warning points out the missing end-of-file newline.- poetry run pytest tests/ --cov=pyspark_datasources --cov-report=xml --cov-report=term-missing +\ poetry run pytest tests/ --cov=pyspark_datasources --cov-report=xml --cov-report=term-missing +While harmless, fixing it silences the linter and keeps formatting consistent.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
.github/workflows/ci.yml(1 hunks)
🧠 Learnings (1)
📓 Common learnings
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
🪛 YAMLlint (1.37.1)
.github/workflows/ci.yml
[error] 52-52: no new line character at the end of file
(new-line-at-end-of-file)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
🪛 YAMLlint (1.37.1)
.github/workflows/ci.yml
[error] 52-52: no new line character at the end of file
(new-line-at-end-of-file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
.github/workflows/ci.yml (1)
47-52: Drop the duplicate test run—execute pytest once with coverageThe workflow still runs the full test suite twice, first without and then with coverage, doubling CI time for no gain. This was pointed out earlier and remains unresolved.
- - name: Run tests - run: poetry run pytest tests/ -v - - - name: Run tests with coverage - run: | - poetry run pytest tests/ --cov=pyspark_datasources --cov-report=xml --cov-report=term-missing + - name: Run tests with coverage + run: | + poetry run pytest tests/ -v \ + --cov=pyspark_datasources \ + --cov-report=xml \ + --cov-report=term-missing
🧹 Nitpick comments (2)
.github/workflows/ci.yml (2)
53-53: Clean up trailing spaces & add terminal newline to satisfy YAML-lint
Minor formatting issues flagged by YAML-lint: remove the two trailing spaces on line 53 and ensure the file ends with a newline.- + +
33-39: Improve cache resiliency withrestore-keysIf the exact hash misses (e.g., after updating
poetry.lock), the job performs a full reinstall. Adding a broaderrestore-keysfallback keeps older caches useful and speeds up CI.uses: actions/cache@v4 with: path: .venv key: venv-${{ runner.os }}-${{ matrix.python-version }}-${{ hashFiles('**/poetry.lock') }} + restore-keys: | + venv-${{ runner.os }}-${{ matrix.python-version }}-
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
.github/workflows/ci.yml(1 hunks)
🧠 Learnings (1)
📓 Common learnings
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
🪛 YAMLlint (1.37.1)
.github/workflows/ci.yml
[error] 53-53: no new line character at the end of file
(new-line-at-end-of-file)
[error] 53-53: trailing spaces
(trailing-spaces)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.651Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
🪛 YAMLlint (1.37.1)
.github/workflows/ci.yml
[error] 53-53: no new line character at the end of file
(new-line-at-end-of-file)
[error] 53-53: trailing spaces
(trailing-spaces)
Hi @allisonwang-db, |
…by adding test_fake_timestamp_column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
pyspark_datasources/fake.py (1)
29-32: Consider improving exception chaining.The static analysis tool correctly identifies that the exception should include the original error context for better debugging.
except AttributeError: - raise Exception( + raise Exception( f"Unable to find a method called `{field.name}` in faker. " f"Please check Faker's documentation to see supported methods." - ) + ) from None
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
pyspark_datasources/fake.py(4 hunks)tests/test_data_sources.py(3 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.695Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
📚 Learning: in pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(t...
Learnt from: sangram11
PR: allisonwang-db/pyspark-data-sources#16
File: pyspark_datasources/fake.py:39-43
Timestamp: 2025-07-22T14:00:29.695Z
Learning: In pyspark_datasources/fake.py, the user prefers to keep datetime.utcnow() instead of datetime.now(timezone.utc) for Python 3.10 compatibility concerns in the GenerateDateTime class.
Applied to files:
pyspark_datasources/fake.pytests/test_data_sources.py
🪛 Ruff (0.12.2)
pyspark_datasources/fake.py
29-32: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
tests/test_data_sources.py
46-46: FakeDataSource may be undefined, or defined from star imports
(F405)
🔇 Additional comments (7)
pyspark_datasources/fake.py (4)
2-3: LGTM: Clean import additions.The new imports are correctly added and necessary for the timestamp functionality. The datetime and timedelta imports support the GenerateDateTime class, while TimestampType is needed for schema validation.
Also applies to: 11-11
24-37: Well-structured schema validation update.The validation logic correctly handles both StringType and TimestampType:
- StringType fields require a corresponding faker method
- TimestampType fields skip faker validation (line 27)
- Clear error messages for unsupported types
The implementation properly maintains backward compatibility while extending functionality.
39-43: Solid timestamp generation implementation.The GenerateDateTime class is well-implemented:
- Appropriate use of classmethod
- Correct return type annotation
- Generates timestamps within past year as intended
- Uses microseconds parameter correctly (addressing past feedback)
- Maintains datetime.utcnow() for Python 3.10 compatibility per user preference
153-153: Correct conditional logic for data generation.Both reader classes properly handle the dual data types:
- StringType fields use faker methods
- TimestampType fields use the new GenerateDateTime.random_datetime()
- Consistent implementation across batch and streaming readers
The ternary operator usage is clean and appropriate for this simple conditional logic.
Also applies to: 182-182
tests/test_data_sources.py (3)
5-5: Appropriate test imports.The additional imports are necessary for creating explicit schemas with TimestampType fields in the tests. The imports are specific and purposeful.
33-34: Good test enhancement.Adding the column count assertion strengthens the existing test by validating the expected schema structure. This helps ensure the default schema remains consistent.
45-56: Comprehensive timestamp column test.The new test thoroughly validates the timestamp functionality:
- Creates explicit schema with TimestampType field
- Validates column names and count
- Confirms the timestamp field has correct data type
- Checks row generation works properly
The test covers all critical aspects of the new timestamp support and provides good regression protection.
|
Hi @allisonwang-db : Can you please review this pull request. It has been pending for a long time. |
Earlier users were not able to generate timestamp data using FakeDataSource. In this change users can generate fake records having timestamp data also.
I have used datetime + random.randint to generate fake records until one last year.
Summary by CodeRabbit