TensorFlow: Refactor section to dedicated page #307

amotl · 2025-09-15T21:42:01Z

About

Continue adding tutorials from the community forum.

Preview

https://cratedb-guide--307.org.readthedocs.build/integrate/tensorflow/

References

Improve pages about machine learning topics #313

/cc @seut

coderabbitai · 2025-09-15T21:42:08Z

Walkthrough

Adds a TensorFlow integration overview and tutorial under docs/integrate/tensorflow, inserts the new entry into the integrate toctree, introduces a tensorflow-tutorial Sphinx anchor and several editorial/code-snippet adjustments in the tutorial, and replaces embedded TensorFlow content in the ML topic with a seealso reference to the new page.

Changes

Cohort / File(s)	Summary
Integrations ToC update `docs/integrate/index.md`	Inserts `tensorflow/index` into the integrations toctree after `telegraf/index`.
New TensorFlow integration page `docs/integrate/tensorflow/index.md`	Adds an integration overview page with hero/logo block, About and Learn rubrics, a 2-column grid containing a grid-item-card linking to the tutorial (hidden toctree entry), and an external link to TensorFlow.org.
Tutorial content & anchor `docs/integrate/tensorflow/tutorial.rst`	Adds Sphinx anchor `.. _tensorflow-tutorial:`, restructures headings (Introduction, Setup, Usage), renames/cases headings, changes code-block language for AWS creds to ini and path to `~/.aws/credentials`, minor code edits (append `.ravel()` to label binarize result, change test_split→train_percent and split calc, fix import alias `tf`), and other editorial refinements.
ML topic index refactor `docs/topic/ml/index.md`	Removes inlined TensorFlow content and hidden toctree entry; replaces with a `seealso` reference pointing to the new TensorFlow integration page and updates related links.

Sequence Diagram(s)

(omitted — documentation additions and editorial/code-snippet adjustments; no runtime/control-flow modifications warranting a sequence diagram)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Layout NG: Folder structure and naming things #236 — edits integrations toctree/navigation, related to adding/reordering integration entries.
ETL/CDC: Guidance, Layout / Dissolve walls of links #239 — modifies docs/integrate/index.md by adding/reordering integration entries (overlap with ToC update).

Suggested reviewers

hammerhead
kneth

Poem

I nibble at docs, hop through each line,
A TensorFlow burrow — tidy and fine.
Anchors and cards, a tutorial trail,
I polish the meadow and leave a soft trail. 🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title accurately and concisely summarizes the primary change: refactoring the TensorFlow documentation into a dedicated page, which matches the added docs/integrate/tensorflow/index.md, toctree updates, and related tutorial files in the changeset. It is specific, on-topic, and clear for a teammate scanning history.
Description Check	✅ Passed	The PR description is on-topic: it states the intent to add community tutorials, provides a preview link, and references the related issue GH-313, all of which directly relate to the documentation changes in this PR. Although brief, it meets the lenient requirement of being related to the changeset.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch tensorflow

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c58be5 and 74387ff.

📒 Files selected for processing (2)

docs/integrate/tensorflow/tutorial.rst (11 hunks)
docs/topic/ml/index.md (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

docs/integrate/tensorflow/tutorial.rst
docs/topic/ml/index.md

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (9)

docs/integrate/tensorflow/tutorial.rst (9)
192-199: Bug: wrong pandas import module (pandas.io.aql).

Use pandas.io.sql; current import will fail at runtime.

Apply:
-import pandas.io.aql as sqlio
+import pandas.io.sql as sqlio
 from psycopg2 import connect
306-310: Fix TensorFlow import alias and avoid private API.

Import TensorFlow as tf (you use tf.* below).

Import EarlyStopping from public tensorflow.keras.callbacks.
-import tensorflow as f
+import tensorflow as tf
 import matplotlib.pyplot as plt
-from tensorflow.python.keras.callbacks import EarlyStopping
+from tensorflow.keras.callbacks import EarlyStopping
 from tensorflow.keras import layers
318-321: Classification head should use sigmoid.

Final layer lacks activation; downstream uses 0.5 threshold -> binary classification.
-mlp_model.add(layers.Dense(1))
+mlp_model.add(layers.Dense(1, activation="sigmoid"))
325-328: Loss/metric mismatch for binary classification.

Use binary_crossentropy (not MSE) with accuracy.
-mlp_model.compile(optimizer="adam",
-                  loss=tf.keras.losses.MeanSquaredError(),
-                  metrics=["accuracy"])
+mlp_model.compile(optimizer="adam",
+                  loss="binary_crossentropy",
+                  metrics=["accuracy"])
459-467: Filename mismatch: add extensions when downloading.

Earlier you saved model as .h5 and scaler as .joblib; download uses names without extensions.
-# These name should be the same as you used for saving the model, including the file extension
-model_name = "some-model-name"
-scaler_name = "some-scaler-name"
+# Use the same names (including extensions) used when saving
+model_name = "some-model-name.h5"
+scaler_name = "some-scaler-name.joblib"
471-477: Repeat pandas import typo and missing numpy import.

Fix pandas import and add numpy for later array conversion.
-    import pandas.io.aql as sqlio
+    import pandas.io.sql as sqlio
     import tensorflow as tf
+    import numpy as np
     from joblib import load
     from psycopg2 import connect
497-503: Bug: referencing df_train in prediction phase.

Use the current dataset length to bound the loop.
-        if row + timestep < len(df_train.index):
+        if row + timestep <= len(norm_data_cols):
             data_window = norm_data_cols[row:row + timestep]
             data_vector = data_window.flatten()
             x_pred.append(data_vector)
             pred_saved_step = row + timestep
         else:
             break
505-508: Shape/typing errors when predicting.

Convert list to array; don’t reshape to a single giant row; flatten result properly.
-    predictions = model.predict(x_pred.reshape(1, -1))
-    predictions.flatten()
+    x_pred = np.asarray(x_pred, dtype=float)
+    predictions = model.predict(x_pred).flatten()
523-533: Use a timezone-aware datetime (replace epoch float).

Pass a timezone-aware datetime (UTC) instead of time.time(); consider batching to reduce DB round-trips.

File: docs/integrate/tensorflow/tutorial.rst (lines 523-533)
-    import time
+    from datetime import datetime, timezone
@@
-                cur.execute(query, (float(entry), time.time(), model_name))
+                cur.execute(
+                    query,
+                    (float(entry), datetime.now(timezone.utc), model_name),
+                )

🧹 Nitpick comments (5)

docs/integrate/tensorflow/tutorial.rst (4)
215-219: Label binarize returns 2D array; flatten to 1D.

Avoid assigning a 2D array into a single column.
-df["machine_status"] = label_binarize(df["machine_status"], classes=["BROKEN", "NORMAL"])
+df["machine_status"] = label_binarize(
+    df["machine_status"], classes=["BROKEN", "NORMAL"]
+).ravel()
223-227: Variable naming: test_split actually sets the train share.

Minor clarity tweak.
-test_split = 70
-df_size = len(df.index)
-split = int((df_size / 100) * test_split)
+train_percent = 70
+split = int((len(df.index) * train_percent) / 100)
99-156: Schema: store timestamps as TIMESTAMP, not TEXT.

Column "timestamp" should be TIMESTAMP [WITH TIME ZONE] for correct typing and functions.
-   "timestamp" TEXT,
+   "timestamp" TIMESTAMP WITH TIME ZONE,
61-84: AWS credentials path note.

You point to ~/.aws/config; typically credentials live in ~/.aws/credentials with profiles configured in ~/.aws/config. SDK will search both; consider mentioning ~/.aws/credentials for clarity.
docs/integrate/tensorflow/index.md (1)

4-8: Avoid hotlinking a hashed gstatic logo.

That path may change; prefer a vendored asset or an official stable logo URL.

I can add the SVG under docs/_assets and switch the src if you want.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 427d701 and addcdff.

📒 Files selected for processing (4)

docs/integrate/index.md (1 hunks)
docs/integrate/tensorflow/index.md (1 hunks)
docs/integrate/tensorflow/tutorial.rst (1 hunks)
docs/topic/ml/index.md (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build docs

🔇 Additional comments (6)

docs/integrate/tensorflow/tutorial.rst (1)

3-3: Anchor addition looks good.

New target tensorflow-tutorial matches the card link in the integration page.

docs/integrate/tensorflow/index.md (3)

1-3: Label and title LGTM.

Provides a stable {ref}tensorflow target used elsewhere.

20-27: Card-to-tutorial link resolves.

:link-type: ref to tensorflow-tutorial matches tutorial.rst anchor.

32-36: Hidden toctree entry is correct.

Ensures sidebar shows the tutorial under the TensorFlow section.

docs/topic/ml/index.md (1)

268-270: Good refactor — approve replacing inline section with cross‑ref.
Cross‑ref target (tensorflow)= exists in docs/integrate/tensorflow/index.md (1 match); no broken refs found.

docs/integrate/index.md (1)

71-72: Navigation entry added correctly — verified.
Single exact match found at docs/integrate/index.md:71 (tensorflow/index).

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)

docs/integrate/tensorflow/tutorial.rst (8)

193-202: Fix pandas import: pandas.io.aql is invalid.

Use pandas.io.sql or top‑level pandas.read_sql_query.

Apply this diff:

-.. code-block:: python
-
-    import pandas.io.aql as sqlio
-    from psycopg2 import connect
+.. code-block:: python
+
+    import pandas.io.sql as sqlio
+    from psycopg2 import connect

308-317: TensorFlow alias mismatch and internal import.

You import tensorflow as f but use tf later → NameError.
Avoid tensorflow.python.* internals; import callbacks from the public API.

Apply this diff:

-    import tensorflow as f
+    import tensorflow as tf
     import matplotlib.pyplot as plt
-    from tensorflow.python.keras.callbacks import EarlyStopping
-    from tensorflow.keras import layers
+    from tensorflow.keras.callbacks import EarlyStopping
+    from tensorflow.keras import layers
@@
-    mlp_model = tf.keras.Sequential()
+    mlp_model = tf.keras.Sequential()

Also applies to: 324-336

318-330: Use sigmoid + binary cross‑entropy for binary classification.

A linear output with MSE is suboptimal for a 0/1 target. Prefer a sigmoid output and binary_crossentropy.

Apply this diff:

-    mlp_model.add(layers.Dense(1))
+    mlp_model.add(layers.Dense(1, activation="sigmoid"))
@@
-    mlp_model.compile(optimizer="adam",
-                      loss=tf.keras.losses.MeanSquaredError(),
-                      metrics=["accuracy"])
+    mlp_model.compile(
+        optimizer="adam",
+        loss="binary_crossentropy",
+        metrics=["accuracy"]
+    )

421-431: Filename/extension mismatch between save and load.

You save model_name = "some-model-name" + ".h5" and scaler_name = "some-scaler-name" + ".joblib", but later download without extensions. This will 404.

Apply this diff to keep names consistent and prefer the modern .keras format:

-    model_name = "some-model-name" + ".h5"
+    model_name = "some-model-name.keras"
@@
-    scaler_name = "some-scaler-name" + ".joblib"
+    scaler_name = "some-scaler-name.joblib"

And here:

-    model_name = "some-model-name"
-    scaler_name = "some-scaler-name"
+    model_name = "some-model-name.keras"
+    scaler_name = "some-scaler-name.joblib"

Also applies to: 461-469

473-484: Fix pandas import (again) and ensure conn_str is documented.

Repeat of the invalid pandas.io.aql. Also, either define conn_str or add a short note showing its format.

Apply this diff:

-    import pandas.io.aql as sqlio
+    import pandas.io.sql as sqlio
@@
-    with connect(conn_str) as conn:
+    with connect(conn_str) as conn:  # e.g., "postgresql://crate@localhost:5432/doc"

494-509: Inference window generation references df_train and mis-shapes input.

df_train is undefined in this script.
Build x_pred as a 2‑D array of shape (num_windows, input_len).
Ensure timestep is defined in this file.

Apply this diff:

-    # Create the x_pred data set, containing the data of the specified time window as a vector
-    x_pred = []
-    pred_saved_step = 0
-    for i in norm_data_cols.index:
-        row = pred_saved_step
-        if row + timestep < len(df_train.index):
-            data_window = norm_data_cols[row:row + timestep]
-            data_vector = data_window.flatten()
-            x_pred.append(data_vector)
-            pred_saved_step = row + timestep
-        else:
-            break
+    # Define timestep if not already defined in this script
+    timestep = 60
+    # Build prediction windows
+    x_pred = []
+    pred_saved_step = 0
+    n = len(norm_data_cols.index)
+    while pred_saved_step + timestep <= n:
+        row = pred_saved_step
+        data_window = norm_data_cols[row:row + timestep]
+        data_vector = data_window.flatten()
+        x_pred.append(data_vector)
+        pred_saved_step = row + timestep
+    x_pred = np.asarray(x_pred)

507-510: Flatten not assigned; prediction on wrong shape.

Call model.predict on a 2‑D array and assign the flattened result.

Apply this diff:

-    predictions = model.predict(x_pred.reshape(1, -1))
-    predictions.flatten()
+    predictions = model.predict(x_pred)
+    predictions = predictions.flatten()

528-535: Insert proper timestamp type instead of time.time().

time.time() returns seconds float; use an actual UTC timestamp for TIMESTAMP WITH TIME ZONE.

Apply this diff:

-    import time
+    from datetime import datetime, timezone
@@
-                cur.execute(query, (float(entry), time.time(), model_name))
+                cur.execute(query, (float(entry), datetime.now(timezone.utc), model_name))

🧹 Nitpick comments (2)

docs/topic/ml/index.md (1)
280-284: Remove or use unused link reference definitions (MD053).

These references are defined but not used in this file. Either link to them in the content or remove to satisfy markdownlint.

Apply this diff if you choose to remove:
-[AutoML with PyCaret and CrateDB]: https://github.com/crate/cratedb-examples/tree/main/topic/machine-learning/pycaret
-[automl-classify-github]: https://github.com/crate/cratedb-examples/blob/main/topic/machine-learning/pycaret/automl_classification_with_pycaret.py
-[automl-classify-colab]: https://colab.research.google.com/github/crate/cratedb-examples/blob/main/topic/machine-learning/pycaret/automl_classification_with_pycaret.ipynb
-[automl-forecasting-github]: https://github.com/crate/cratedb-examples/blob/main/topic/machine-learning/pycaret/automl_timeseries_forecasting_with_pycaret.ipynb
-[automl-forecasting-colab]: https://colab.research.google.com/github/crate/cratedb-examples/blob/main/topic/machine-learning/pycaret/automl_timeseries_forecasting_with_pycaret.ipynb
docs/integrate/tensorflow/tutorial.rst (1)
433-436: Credentials path comment inconsistent with prerequisites.

Earlier you instruct to use ~/.aws/credentials, but this comment says ~/.aws/config. Align the comment.

Apply this diff:
-    # will be read from the ~/.aws/config file and then the files will be
+    # will be read from the ~/.aws/credentials file (via the default boto3 chain) and then the files will be

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between addcdff and 55f8f58.

📒 Files selected for processing (4)

docs/integrate/index.md (1 hunks)
docs/integrate/tensorflow/index.md (1 hunks)
docs/integrate/tensorflow/tutorial.rst (9 hunks)
docs/topic/ml/index.md (2 hunks)

✅ Files skipped from review due to trivial changes (1)

docs/integrate/index.md

🧰 Additional context used

🪛 markdownlint-cli2 (0.17.2)

docs/topic/ml/index.md

280-280: Link and image reference definitions should be needed
Unused link or image reference definition: "automl with pycaret and cratedb"