You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Then I added the SentenceEncodingMixin class to the TNTM model class build and modified some issues in the umap_model build. Then re-run the training code and get the error reported:
2024-12-19 15:48:07.837 | INFO | stream_topic.models.abstract_helper_models.base:prepare_embeddings:225 - --- Creating /hongyi/stream/sentence-transformers/all-MiniLM-L6-v2 document embeddings ---
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2225/2225 [00:54<00:00, 40.89it/s]
2024-12-19 15:49:02.694 | INFO | stream_topic.models.tntm:_initialize_datamodule:371 - --- Initializing Datamodule for TNTM ---
2024-12-19 15:49:02.964 | INFO | stream_topic.models.tntm:_prepare_word_embeddings:335 - --- Creating /hongyi/stream/sentence-transformers/paraphrase-MiniLM-L3-v2 word embeddings ---
Batches: 100%
253/253 [00:01<00:00, 129.29it/s]
/hongyi/STREAM/stream_topic/models/neural_base_models/ UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
self.word_embeddings_projected = torch.tensor(word_embeddings_projected)
2024-12-19 15:49:38.776 | INFO | stream_topic.models.tntm:_initialize_trainer:279 - --- Initializing Trainer for TNTM ---
Trainer will use only 1 of 2 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=2)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/hongyi/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/ Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
2024-12-19 15:49:38.798 | INFO | stream_topic.models.tntm:fit:489 - --- Training TNTM topic model ---
You are using a CUDA device ('NVIDIA A800 80GB PCIe') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read
/hongyi/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/callbacks/ Checkpoint directory /hongyi/STREAM/checkpoints exists and is not empty.
| Name | Type | Params | Mode
0 | model | TNTMBase | 5.2 M | train
1 | model.inference_network | InferenceNetwork | 5.2 M | train
2 | model.mean_bn | BatchNorm1d | 10 | train
3 | model.logvar_bn | BatchNorm1d | 10 | train
4 | model.beta_batchnorm | BatchNorm1d | 16.1 K | train
5 | model.theta_drop | Dropout | 0 | train
5.2 M Trainable params
8.1 K Non-trainable params
5.2 M Total params
20.916 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%
0/2 [00:00<?, ?it/s]
/hongyi/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/ The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=255` in the `DataLoader` to improve performance.
2024-12-19 15:49:38.955 | ERROR | stream_topic.models.tntm:fit:496 - Error in training: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
RuntimeError Traceback (most recent call last)
Cell In[2], line 3
1 from stream_topic.models import KmeansTM,CEDC, ETM,DCTE,LDA,ProdLDA,NSTM,CTM,CTMNeg,CBC,BERTopicTM,TNTM
2 model = TNTM(word_embedding_model_name="/hongyi/stream/sentence-transformers/paraphrase-MiniLM-L3-v2",embedding_model_name="/hongyi/stream/sentence-transformers/all-MiniLM-L6-v2")#
----> 3,n_topics=5)#
5 topics = model.get_topics()
6 print(topics)
File ~/STREAM/stream_topic/models/, in, dataset, n_topics, val_size, lr, lr_patience, patience, factor, weight_decay, max_epochs, batch_size, shuffle, random_state, inferece_type, checkpoint_path, monitor, mode, trial, optimize, **kwargs)
490 self._status = TrainingStatus.RUNNING
491 #"cuda:0")
492 # print(self.model.device)
--> 493, self.data_module)
495 except Exception as e:
496 logger.error(f"Error in training: {e}")
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/, in, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
541 self.state.status = TrainerStatus.RUNNING
542 = True
--> 543 call._call_and_handle_interrupt(
544 self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
545 )
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
42 if trainer.strategy.launcher is not None:
43 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
---> 44 return trainer_fn(*args, **kwargs)
46 except _TunerExitException:
47 _call_teardown_hook(trainer)
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
572 assert self.state.fn is not None
573 ckpt_path = self._checkpoint_connector._select_ckpt_path(
574 self.state.fn,
575 ckpt_path,
576 model_provided=True,
577 model_connected=self.lightning_module is not None,
578 )
--> 579 self._run(model, ckpt_path=ckpt_path)
581 assert self.state.stopped
582 = False
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/, in Trainer._run(self, model, ckpt_path)
981 self._signal_connector.register_signal_handlers()
983 # ----------------------------
985 # ----------------------------
--> 986 results = self._run_stage()
988 # ----------------------------
989 # POST-Training CLEAN UP
990 # ----------------------------
991 log.debug(f"{self.__class__.__name__}: trainer tearing down")
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/, in Trainer._run_stage(self)
1026 if
1027 with isolate_rng():
-> 1028 self._run_sanity_check()
1029 with torch.autograd.set_detect_anomaly(self._detect_anomaly):
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/, in Trainer._run_sanity_check(self)
1054 call._call_callback_hooks(self, "on_sanity_check_start")
1056 # run eval step
-> 1057
1059 call._call_callback_hooks(self, "on_sanity_check_end")
1061 # reset logger connector
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/loops/, in _no_grad_context.<locals>._decorator(self, *args, **kwargs)
180 context_manager = torch.no_grad
181 with context_manager():
--> 182 return loop_run(self, *args, **kwargs)
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/loops/, in
133 self.batch_progress.is_last_batch = data_fetcher.done
134 # run step hooks
--> 135 self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
136 except StopIteration:
137 # this needs to wrap the `*_step` call too (not just `next`) for `dataloader_iter` support
138 break
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/loops/, in _EvaluationLoop._evaluation_step(self, batch, batch_idx, dataloader_idx, dataloader_iter)
390 hook_name = "test_step" if trainer.testing else "validation_step"
391 step_args = (
392 self._build_step_args_from_hook_kwargs(hook_kwargs, hook_name)
393 if not using_dataloader_iter
394 else (dataloader_iter,)
395 )
--> 396 output = call._call_strategy_hook(trainer, hook_name, *step_args)
398 self.batch_progress.increment_processed()
400 if using_dataloader_iter:
401 # update the hook kwargs now that the step method might have consumed the iterator
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/, in _call_strategy_hook(trainer, hook_name, *args, **kwargs)
308 return None
310 with trainer.profiler.profile(f"[Strategy]{trainer.strategy.__class__.__name__}.{hook_name}"):
--> 311 output = fn(*args, **kwargs)
313 # restore current_fx when nested context
314 pl_module._current_fx_name = prev_fx_name
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/strategies/, in Strategy.validation_step(self, *args, **kwargs)
409 if self.model != self.lightning_module:
410 return self._forward_redirection(self.model, self.lightning_module, "validation_step", *args, **kwargs)
--> 411 return self.lightning_module.validation_step(*args, **kwargs)
File ~/STREAM/stream_topic/models/abstract_helper_models/, in NeuralBaseModel.validation_step(self, batch, batch_idx)
45 def validation_step(self, batch, batch_idx):
---> 46 val_loss = self.model.compute_loss(batch)
48 self.log(
49 "val_loss",
50 val_loss,
54 logger=True,
55 )
57 return val_loss
File ~/STREAM/stream_topic/models/neural_base_models/, in TNTMBase.compute_loss(self, x)
201 """
202 Computes the loss for the model.
212 The computed loss.
213 """
214 x_bow = x['bow']
--> 215 log_recon, posterior_mean, posterior_logvar = self.forward(x)
216 loss = self.loss_function(x_bow, log_recon, posterior_mean, posterior_logvar)
217 return loss
File ~/STREAM/stream_topic/models/neural_base_models/, in TNTMBase.forward(self, x)
124 """
125 Forward pass through the network.
139 The log variance of the variational posterior.
140 """
141 theta, posterior_mean, posterior_logvar = self.get_theta(x)
--> 143 log_beta = self.calc_log_beta()
147 # prodLDA vs LDA
148 # use numerical trick to compute log(beta @ theta )
149 log_theta = torch.nn.LogSoftmax(dim=-1)(theta) #calculate log theta = log_softmax(theta_hat)
File ~/STREAM/stream_topic/models/neural_base_models/, in TNTMBase.calc_log_beta(self)
109 log_probs = torch.zeros(self.n_topics, self.vocab_size)
111 for i, dis in enumerate(normal_dis_lis):
--> 112 log_probs[i] = dis.log_prob(self.word_embeddings_projected)
113 return log_probs
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/torch/distributions/, in LowRankMultivariateNormal.log_prob(self, value)
212 if self._validate_args:
213 self._validate_sample(value)
--> 214 diff = value - self.loc
215 M = _batch_lowrank_mahalanobis(
216 self._unbroadcasted_cov_factor,
217 self._unbroadcasted_cov_diag,
218 diff,
219 self._capacitance_tril,
220 )
221 log_det = _batch_lowrank_logdet(
222 self._unbroadcasted_cov_factor,
223 self._unbroadcasted_cov_diag,
224 self._capacitance_tril,
225 )
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Finally I tried to move both self.model and its parameters to “cuda:0”, but it still reports the same error.
The text was updated successfully, but these errors were encountered:
The SentenceEncoding issue is fixed on main. However, I currently cannot recreate the device issue...Since we are using lightning and all tensors are usually transferred to the same device, I am not sure where this issue might come from. I'll try and recreate once I am on a machine with GPU and will revisit this issue then.
When I wanted to use TNTM model, I got the following error.
Then I added the SentenceEncodingMixin class to the TNTM model class build and modified some issues in the umap_model build. Then re-run the training code and get the error reported:
Finally I tried to move both self.model and its parameters to “cuda:0”, but it still reports the same error.
The text was updated successfully, but these errors were encountered: