Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 23, 2025

📄 16,160% (161.60x) speedup for find_last_node in src/algorithms/graph.py

⏱️ Runtime : 73.3 milliseconds 451 microseconds (best of 250 runs)

📝 Explanation and details

The optimized code achieves a 162x speedup by eliminating a nested loop algorithm complexity issue.

Key optimization:
The original code uses all(e["source"] != n["id"] for e in edges) inside a generator that iterates through nodes. This creates an O(N×E) algorithm where for each node, it checks against ALL edges to verify none have that node as a source.

The optimized version precomputes edge_sources = {e["source"] for e in edges} as a set once (O(E) time), then performs constant-time O(1) set membership checks with n["id"] not in edge_sources for each node (O(N) time total). This reduces overall complexity from O(N×E) to O(N+E).

Why this matters:

  • Set lookup is O(1) vs list iteration O(E): Hash-based set membership is dramatically faster than scanning through all edges repeatedly
  • Single pass over edges: The set is built once upfront rather than scanning edges for every node candidate
  • Scales exceptionally well: Test results show the optimization shines with larger graphs:
    • Small graphs (2-3 nodes): 60-80% faster
    • Large linear chain (1000 nodes): 32,000%+ faster (18.3ms → 56.9μs)
    • Large fully connected graph (100 nodes, 9900 edges): 8,600%+ faster (17.1ms → 196μs)

Impact analysis:
This function finds "terminal" or "sink" nodes in a directed graph (nodes with no outgoing edges). If this is called frequently in graph processing pipelines, especially with larger graphs, this optimization provides substantial performance gains. The improvement is most pronounced when the number of edges is large relative to nodes, which is common in real-world graph applications.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 39 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
from __future__ import annotations

# imports
import pytest  # used for our unit tests
from src.algorithms.graph import find_last_node

# unit tests

# ---------------------------
# Basic Test Cases
# ---------------------------


def test_single_node_no_edges():
    # Single node, no edges: should return the node itself
    nodes = [{"id": 1}]
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.25μs -> 1.00μs (25.0% faster)


def test_two_nodes_one_edge():
    # Two nodes, one edge: should return the target node
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.92μs -> 1.08μs (76.8% faster)


def test_three_nodes_linear_chain():
    # Three nodes, linear chain: 1->2->3, last node is 3
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 2}, {"source": 2, "target": 3}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.29μs -> 1.25μs (83.3% faster)


def test_multiple_last_nodes_returns_first():
    # Two nodes with no outgoing edges: should return the first one found
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 2}]
    # Both 2 and 3 have no outgoing edges
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.79μs -> 1.08μs (65.5% faster)


# ---------------------------
# Edge Test Cases
# ---------------------------


def test_empty_nodes_and_edges():
    # No nodes, no edges: should return None
    nodes = []
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 833ns -> 875ns (4.80% slower)


def test_nodes_with_only_incoming_edges():
    # All nodes have only incoming edges, no outgoing edges
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [
        {"source": 4, "target": 1},
        {"source": 5, "target": 2},
        {"source": 6, "target": 3},
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.62μs -> 1.21μs (34.5% faster)


def test_nodes_with_self_loop():
    # Node with a self-loop should not be considered a last node
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 1}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.79μs -> 1.12μs (59.2% faster)


def test_all_nodes_with_outgoing_edges():
    # Every node has an outgoing edge: should return None
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}, {"source": 2, "target": 1}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.88μs -> 1.17μs (60.7% faster)


def test_node_ids_are_strings():
    # Node ids as strings
    nodes = [{"id": "a"}, {"id": "b"}]
    edges = [{"source": "a", "target": "b"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.00μs -> 1.17μs (71.4% faster)


def test_edges_with_unknown_nodes():
    # Edge refers to unknown nodes (should be ignored)
    nodes = [{"id": 1}]
    edges = [{"source": 2, "target": 3}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.54μs -> 1.08μs (42.2% faster)


def test_duplicate_nodes():
    # Duplicate nodes in the list
    nodes = [{"id": 1}, {"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.96μs -> 1.17μs (67.8% faster)


def test_edge_dict_missing_source_key():
    # Edge dict missing 'source' key should not cause crash
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"target": 2}]
    with pytest.raises(KeyError):
        find_last_node(nodes, edges)  # 1.96μs -> 792ns (147% faster)


def test_edge_dict_missing_target_key():
    # Edge dict missing 'target' key should not affect last node logic
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.88μs -> 1.17μs (60.8% faster)


def test_edge_with_none_source():
    # Edge with None as source should not match any node
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": None, "target": 1}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.67μs -> 1.12μs (48.2% faster)


def test_node_id_none():
    # Node with id None should be considered as a possible last node
    nodes = [{"id": None}, {"id": 2}]
    edges = [{"source": 2, "target": None}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.58μs -> 1.08μs (46.0% faster)


# ---------------------------
# Large Scale Test Cases
# ---------------------------


def test_large_linear_chain():
    # Large linear chain: 0->1->2->...->999, last node is 999
    nodes = [{"id": i} for i in range(1000)]
    edges = [{"source": i, "target": i + 1} for i in range(999)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 18.3ms -> 56.9μs (32077% faster)


def test_large_fully_connected_graph():
    # Every node has outgoing edges to all other nodes: no last node
    nodes = [{"id": i} for i in range(100)]
    edges = [
        {"source": i, "target": j} for i in range(100) for j in range(100) if i != j
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 17.1ms -> 196μs (8617% faster)


def test_large_sparse_graph_multiple_last_nodes():
    # Large graph with half the nodes having no outgoing edges
    nodes = [{"id": i} for i in range(500)]
    edges = [{"source": i, "target": i + 1} for i in range(250)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.16ms -> 13.5μs (8528% faster)


def test_large_graph_with_isolated_nodes():
    # Large graph with isolated nodes (no edges)
    nodes = [{"id": i} for i in range(1000)]
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.38μs -> 1.08μs (27.0% faster)


def test_large_graph_with_duplicate_ids():
    # Large node list with duplicate ids, only one has outgoing edge
    nodes = [{"id": i} for i in range(500)] + [{"id": 0}]
    edges = [{"source": 0, "target": 1}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.92μs -> 1.25μs (53.4% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from __future__ import annotations

# imports
import pytest
from src.algorithms.graph import find_last_node

# unit tests

# --- Basic Test Cases ---


def test_single_node_no_edges():
    # Single node, no edges: should return the node itself
    nodes = [{"id": 1}]
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.21μs -> 958ns (26.1% faster)


def test_two_nodes_one_edge():
    # Two nodes, one edge from 1 to 2: should return node 2
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.83μs -> 1.12μs (62.9% faster)


def test_three_nodes_linear_chain():
    # 1 -> 2 -> 3: should return node 3
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 2}, {"source": 2, "target": 3}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.21μs -> 1.25μs (76.6% faster)


def test_three_nodes_two_leaves():
    # 1 -> 2, 1 -> 3: both 2 and 3 are leaves, should return first found (2)
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 2}, {"source": 1, "target": 3}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.71μs -> 1.12μs (51.9% faster)
    # Should be one of the leaves


def test_disconnected_nodes():
    # 1 -> 2, 3 is disconnected: should return 2 or 3 (both are not sources)
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.83μs -> 1.12μs (62.9% faster)


# --- Edge Test Cases ---


def test_empty_nodes_and_edges():
    # No nodes, no edges: should return None
    nodes = []
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 834ns -> 916ns (8.95% slower)


def test_nodes_but_no_edges():
    # Multiple nodes, no edges: should return the first node
    nodes = [{"id": "a"}, {"id": "b"}, {"id": "c"}]
    edges = []
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.25μs -> 958ns (30.5% faster)


def test_all_nodes_are_sources():
    # Every node is a source in at least one edge: should return None
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}, {"source": 2, "target": 1}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.88μs -> 1.17μs (60.7% faster)


def test_cycle_graph():
    # 1 -> 2 -> 3 -> 1 (cycle): no node is a leaf, should return None
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [
        {"source": 1, "target": 2},
        {"source": 2, "target": 3},
        {"source": 3, "target": 1},
    ]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.21μs -> 1.29μs (71.1% faster)


def test_node_with_multiple_incoming_edges():
    # 1 -> 3, 2 -> 3: 3 is a leaf
    nodes = [{"id": 1}, {"id": 2}, {"id": 3}]
    edges = [{"source": 1, "target": 3}, {"source": 2, "target": 3}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.25μs -> 1.21μs (86.1% faster)


def test_duplicate_edges():
    # 1 -> 2 (twice): 2 is still a leaf
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2}, {"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.83μs -> 1.21μs (51.8% faster)


def test_nodes_with_non_integer_ids():
    # Node IDs are strings
    nodes = [{"id": "x"}, {"id": "y"}]
    edges = [{"source": "x", "target": "y"}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 2.00μs -> 1.12μs (77.8% faster)


def test_edge_with_missing_source_key():
    # Edge missing 'source' key should raise KeyError
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"target": 2}]
    with pytest.raises(KeyError):
        find_last_node(nodes, edges)  # 2.29μs -> 792ns (189% faster)


def test_edge_with_extra_keys():
    # Edge with extra keys should be ignored
    nodes = [{"id": 1}, {"id": 2}]
    edges = [{"source": 1, "target": 2, "weight": 5}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.92μs -> 1.17μs (64.4% faster)


def test_nodes_with_extra_keys():
    # Node with extra keys should be returned as is
    nodes = [{"id": 1, "label": "A"}, {"id": 2, "label": "B"}]
    edges = [{"source": 1, "target": 2}]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 1.79μs -> 1.12μs (59.3% faster)


# --- Large Scale Test Cases ---


def test_large_linear_chain():
    # Large chain: 0 -> 1 -> ... -> 999
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": i, "target": i + 1} for i in range(N - 1)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 18.3ms -> 56.8μs (32162% faster)


def test_large_star_graph():
    # Star: 0 -> 1, 0 -> 2, ..., 0 -> 999
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": 0, "target": i} for i in range(1, N)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 37.8μs -> 20.6μs (83.0% faster)


def test_large_disconnected_graph():
    # 500 disconnected nodes, 500 in a chain
    N = 500
    nodes = [{"id": i} for i in range(N * 2)]
    edges = [{"source": i, "target": i + 1} for i in range(N, N * 2 - 1)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 20.8μs -> 14.9μs (39.8% faster)
    # Leaves: 0..499 (disconnected), and 999 (end of chain)
    leaf_ids = set(range(N)) | {N * 2 - 1}


def test_large_graph_all_sources():
    # Every node is a source in a cycle (0->1->...->999->0)
    N = 1000
    nodes = [{"id": i} for i in range(N)]
    edges = [{"source": i, "target": (i + 1) % N} for i in range(N)]
    codeflash_output = find_last_node(nodes, edges)
    result = codeflash_output  # 18.2ms -> 56.3μs (32279% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-find_last_node-mjicte92 and push.

Codeflash Static Badge

The optimized code achieves a **162x speedup** by eliminating a nested loop algorithm complexity issue. 

**Key optimization:**
The original code uses `all(e["source"] != n["id"] for e in edges)` inside a generator that iterates through nodes. This creates an O(N×E) algorithm where for each node, it checks against ALL edges to verify none have that node as a source.

The optimized version precomputes `edge_sources = {e["source"] for e in edges}` as a set once (O(E) time), then performs constant-time O(1) set membership checks with `n["id"] not in edge_sources` for each node (O(N) time total). This reduces overall complexity from O(N×E) to O(N+E).

**Why this matters:**
- **Set lookup is O(1) vs list iteration O(E)**: Hash-based set membership is dramatically faster than scanning through all edges repeatedly
- **Single pass over edges**: The set is built once upfront rather than scanning edges for every node candidate
- **Scales exceptionally well**: Test results show the optimization shines with larger graphs:
  - Small graphs (2-3 nodes): 60-80% faster
  - Large linear chain (1000 nodes): **32,000%+ faster** (18.3ms → 56.9μs)
  - Large fully connected graph (100 nodes, 9900 edges): **8,600%+ faster** (17.1ms → 196μs)

**Impact analysis:**
This function finds "terminal" or "sink" nodes in a directed graph (nodes with no outgoing edges). If this is called frequently in graph processing pipelines, especially with larger graphs, this optimization provides substantial performance gains. The improvement is most pronounced when the number of edges is large relative to nodes, which is common in real-world graph applications.
@codeflash-ai codeflash-ai bot requested a review from KRRT7 December 23, 2025 09:00
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant