Skip to content

Commit 260c02f

Browse files
authored
Merge branch 'cleanup' into fix_embed_pred_links_gpu
2 parents b8a4f5a + d80f66f commit 260c02f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+17928
-2417
lines changed

.github/workflows/ci.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -199,11 +199,21 @@ jobs:
199199
source pygraphistry/bin/activate
200200
./bin/typecheck.sh
201201
202+
- name: Full dbscan tests (rich featurize)
203+
run: |
204+
source pygraphistry/bin/activate
205+
./bin/test-dbscan.sh
206+
202207
- name: Full feature tests (rich featurize)
203208
run: |
204209
source pygraphistry/bin/activate
205210
./bin/test-features.sh
206211
212+
- name: Full search tests (rich featurize)
213+
run: |
214+
source pygraphistry/bin/activate
215+
./bin/test-text.sh
216+
207217
- name: Full umap tests (rich featurize)
208218
run: |
209219
source pygraphistry/bin/activate

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,16 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
77

88
## [Development]
99

10+
### Changed
11+
* AI: moves public `g.g_dgl` from KG `embed` method to private method `g._kg_dgl`
12+
* AI: BREAKING CHANGES: to return matrices during transform, set the flag: `X, y = g.transform(df, return_graph=False)` default behavior is ~ `g2 = g.transform(df)` returning a Plottable instance.
13+
1014
### Added
15+
* AI: all `transform_*` methods return graphistry Plottable instances, using an infer_graph method. To return matrices, set the `return_graph=False` flag.
16+
* AI: adds `g.get_matrix(**kwargs)` general method to retrieve (sub)-feature/target matrices
17+
* AI: DBSCAN -- `g.featurize().dbscan()` and `g.umap().dbscan()` with options to use UMAP embedding, feature matrix, or subset of feature matrix via `g.dbscan(cols=[...])`
18+
* AI: Demo cleanup using ModelDict & new features, refactoring demos using `dbscan` and `transform` methods.
19+
* Tests: dbscan tests
1120
* AI: Easy import of featurization kwargs for `g.umap(**kwargs)` and `g.featurize(**kwargs)`
1221
* AI: `g.get_features_by_cols` returns featurized submatrix with `col_part` in their columns
1322
* AI: `g.conditional_graph` and `g.conditional_probs` assessing conditional probs and graph

README.md

Lines changed: 72 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -358,61 +358,72 @@ Automatically and intelligently transform text, numbers, booleans, and other for
358358
g = g.umap() # UMAP, GNNs, use features if already provided, otherwise will compute
359359

360360
# other pydata libraries
361-
X = g._node_features # g._get_feature('nodes')
362-
y = g._node_target # g._get_target('nodes')
361+
X = g._node_features # g._get_feature('nodes') or g.get_matrix()
362+
y = g._node_target # g._get_target('nodes') or g.get_matrix(target=True)
363363
from sklearn.ensemble import RandomForestRegressor
364-
model = RandomForestRegressor().fit(X, y) #assumes train/test split
365-
new_df = pandas.read_csv(...)
366-
X_new, _ = g.transform(new_df, None, kind='nodes')
364+
model = RandomForestRegressor().fit(X, y) # assumes train/test split
365+
new_df = pandas.read_csv(...) # mini batch
366+
X_new, _ = g.transform(new_df, None, kind='nodes', return_graph=False)
367367
preds = model.predict(X_new)
368368
```
369369

370370
* Encode model definitions and compare models against each other
371371

372372
```python
373373
# graphistry
374-
from graphistry.features import search_model, topic_model, ngrams_model, ModelDict, default_featurize_parameters
374+
from graphistry.features import search_model, topic_model, ngrams_model, ModelDict, default_featurize_parameters, default_umap_parameters
375375

376376
g = graphistry.nodes(df)
377377
g2 = g.umap(X=[..], y=[..], **search_model)
378378

379-
# set custom encoding model with any feature kwargs
379+
# set custom encoding model with any feature/umap/dbscan kwargs
380380
new_model = ModelDict(message='encoding new model parameters is easy', **default_featurize_parameters)
381381
new_model.update(dict(
382382
y=[...],
383383
kind='edges',
384-
model_name='sbert/hf/a_cool_transformer_model',
384+
model_name='sbert/cool_transformer_model',
385385
use_scaler_target='kbins',
386386
n_bins=11,
387387
strategy='normal'))
388388
print(new_model)
389389

390390
g3 = g.umap(X=[..], **new_model)
391391
# compare g2 vs g3 or add to different pipelines
392-
# ...
393392
```
394393

395394

396395
See `help(g.featurize)` for more options
397396

398397
### [sklearn-based UMAP](https://umap-learn.readthedocs.io/en/latest/), [cuML-based UMAP](https://docs.rapids.ai/api/cuml/stable/api.html?highlight=umap#cuml.UMAP)
399398

400-
* Reduce dimensionality and plot a similarity graph from feature vectors:
399+
* Reduce dimensionality by plotting a similarity graph from feature vectors:
401400

402401
```python
403402
# automatic feature engineering, UMAP
404403
g = graphistry.nodes(df).umap()
405404

406-
# plot the similarity graph even though there was no explicit edge_dataframe passed in -- it is created during UMAP.
405+
# plot the similarity graph without any explicit edge_dataframe passed in -- it is created during UMAP.
407406
g.plot()
408407
```
409408

410409
* Apply a trained model to new data:
411410

412411
```python
413412
new_df = pd.read_csv(...)
414-
embeddings, X_new, _ = g.transform_umap(new_df, None, kind='nodes')
413+
embeddings, X_new, _ = g.transform_umap(new_df, None, kind='nodes', return_graph=False)
415414
```
415+
* Infer a new graph from new data using the old umap coordinates to run inference without having to train a new umap model.
416+
417+
```python
418+
new_df = pd.read_csv(...)
419+
g2 = g.transform_umap(new_df, return_graph=True) # return_graph=True is default
420+
g2.plot() #
421+
422+
# or if you want the new minibatch to cluster to closest points in previous fit:
423+
g3 = g.transform_umap(new_df, return_graph=True, merge_policy=True)
424+
g3.plot() # useful to see how new data connects to old -- play with `sample` and `n_neighbors` to control how much of old to include
425+
```
426+
416427

417428
* UMAP supports many options, such as supervised mode, working on a subset of columns, and passing arguments to underlying `featurize()` and UMAP implementations (see `help(g.umap)`):
418429

@@ -451,11 +462,11 @@ See `help(g.umap)` for more options
451462

452463
from [your_training_pipeline] import train, model
453464
# Train
454-
g = graphistry.nodes(df).build_gnn(y=`target`)
465+
g = graphistry.nodes(df).build_gnn(y_nodes=`target`)
455466
G = g.DGL_graph
456467
train(G, model)
457468
# predict on new data
458-
X_new, _ = g.transform(new_df, None, kind='nodes' or 'edges') # no targets
469+
X_new, _ = g.transform(new_df, None, kind='nodes' or 'edges', return_graph=False) # no targets
459470
predictions = model.predict(G_new, X_new)
460471
```
461472

@@ -480,12 +491,21 @@ GNN support is rapidly evolving, please contact the team directly or on Slack fo
480491
#encode text as paraphrase embeddings, supports any sbert model
481492
model_name = "paraphrase-MiniLM-L6-v2")
482493

494+
# or use convienence `ModelDict` to store parameters
495+
496+
from graphistry.features import search_model
497+
g2 = g.featurize(X = ['text_col_1', .., 'text_col_n'], kind='nodes', **search_model)
498+
499+
# query using the power of transformers to find richly relevant results
500+
483501
results_df, query_vector = g2.search('my natural language query', ...)
484502

485-
print(results_df[['_distance', 'text_col_1', ..., 'text_col_n']]) #sorted by relevancy
503+
print(results_df[['_distance', 'text_col', ..]]) #sorted by relevancy
504+
505+
# or see graph of matching entities and original edges
486506

487-
# or see graph of matching entities and similarity edges (or optional original edges)
488507
g2.search_graph('my natural language query', ...).plot()
508+
489509
```
490510

491511

@@ -521,7 +541,7 @@ See `help(g.search_graph)` for options
521541
relation=['relationship_1', 'relationship_4', ..],
522542
destination=['entity_l', 'entity_m', ..],
523543
threshold=0.9, # score threshold
524-
return_dataframe=False) # set to `True` to return dataframe, or just access via `g5._edges`
544+
return_dataframe=False) # set to `True` to return dataframe, or just access via `g4._edges`
525545
```
526546

527547
* Detect Anamolous Behavior (example use cases such as Cyber, Fraud, etc)
@@ -552,8 +572,42 @@ See `help(g.search_graph)` for options
552572
g2.predict_links_all(threshold=0.95).plot()
553573
```
554574

555-
See `help(g.embed)`, `help(g.predict_links)` , `help(g.predict_links_all)` for options
575+
See `help(g.embed)`, `help(g.predict_links)` , or `help(g.predict_links_all)` for options
576+
577+
### DBSCAN
578+
579+
* Enrich UMAP embeddings or featurization dataframe with GPU or CPU DBSCAN
580+
581+
```python
582+
g = graphistry.edges(edf, 'src', 'dst').nodes(ndf, 'node')
583+
584+
# cluster by UMAP embeddings
585+
kind = 'nodes' | 'edges'
586+
g2 = g.umap(kind=kind).dbscan(kind=kind)
587+
print(g2._nodes['_dbscan']) | print(g2._edges['_dbscan'])
588+
589+
# dbscan in `umap` or `featurize` via flag
590+
g2 = g.umap(dbscan=True, min_dist=0.2, min_samples=1)
591+
592+
# or via chaining,
593+
g2 = g.umap().dbscan(min_dist=1.2, min_samples=2, **kwargs)
594+
595+
# cluster by feature embeddings
596+
g2 = g.featurize().dbscan(**kwargs)
597+
598+
# cluster by a given set of feature column attributes, inhereted from `g.get_matrix(cols)`
599+
g2 = g.featurize().dbscan(cols=['ip_172', 'location', 'alert'], **kwargs)
600+
601+
# equivalent to above (ie, cols != None and umap=True will still use features dataframe, rather than UMAP embeddings)
602+
g2 = g.umap().dbscan(cols=['ip_172', 'location', 'alert'], umap=True | False, **kwargs)
603+
g2.plot() # color by `_dbscan`
604+
605+
new_df = pd.read_csv(..)
606+
# transform on new data according to fit dbscan model
607+
g3 = g2.transform_dbscan(new_df)
608+
```
556609

610+
See `help(g.dbscan)` or `help(g.transform_dbscan)` for options
557611

558612
### Quickly configurable
559613

bin/test-dbscan.sh

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#!/bin/bash
2+
set -ex
3+
4+
# Run from project root
5+
# - Args get passed to pytest phase
6+
# Non-zero exit code on fail
7+
8+
# Assume [umap-learn,test]
9+
10+
python -m pytest --version
11+
12+
python -B -m pytest -vv \
13+
graphistry/tests/test_compute_cluster.py
14+
15+
#chmod +x bin/test-dbscan.sh

bin/test-text.sh

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#!/bin/bash
2+
set -ex
3+
4+
# Run from project root
5+
# - Args get passed to pytest phase
6+
# Non-zero exit code on fail
7+
8+
# Assume [umap-learn,test]
9+
10+
python -m pytest --version
11+
12+
python -B -m pytest -vv \
13+
graphistry/tests/test_text_utils.py
14+
15+
# chmod +x bin/test-text.sh

0 commit comments

Comments
 (0)