Open
Description
When I wanted to use TNTM model, I got the following error.
Code:
from stream_topic.models import TNTM
from stream_topic.utils import TMDataset
dataset = TMDataset()
dataset.fetch_dataset("BBC_News")
dataset.preprocess(model_type="TNTM")
model = TNTM()
model.fit(dataset)
Error:
[/usr/local/lib/python3.10/dist-packages/stream_topic/models/abstract_helper_models/base.py](https://localhost:8080/#) in prepare_embeddings(self, dataset, logger)
226 f"--- Creating {self.embedding_model_name} document embeddings ---"
227 )
--> 228 embeddings = self.encode_documents(
229 dataset.texts, encoder_model=self.embedding_model_name, use_average=True
230 )
AttributeError: 'TNTM' object has no attribute 'encode_documents'
Then I added the SentenceEncodingMixin class to the TNTM model class build and modified some issues in the umap_model build. Then re-run the training code and get the error reported:
2024-12-19 15:48:07.837 | INFO | stream_topic.models.abstract_helper_models.base:prepare_embeddings:225 - --- Creating /hongyi/stream/sentence-transformers/all-MiniLM-L6-v2 document embeddings ---
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2225/2225 [00:54<00:00, 40.89it/s]
2024-12-19 15:49:02.694 | INFO | stream_topic.models.tntm:_initialize_datamodule:371 - --- Initializing Datamodule for TNTM ---
2024-12-19 15:49:02.964 | INFO | stream_topic.models.tntm:_prepare_word_embeddings:335 - --- Creating /hongyi/stream/sentence-transformers/paraphrase-MiniLM-L3-v2 word embeddings ---
Batches: 100%
253/253 [00:01<00:00, 129.29it/s]
/hongyi/STREAM/stream_topic/models/neural_base_models/tntm_base.py:61: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
self.word_embeddings_projected = torch.tensor(word_embeddings_projected)
2024-12-19 15:49:38.776 | INFO | stream_topic.models.tntm:_initialize_trainer:279 - --- Initializing Trainer for TNTM ---
Trainer will use only 1 of 2 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=2)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.model_summary.ModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/hongyi/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
2024-12-19 15:49:38.798 | INFO | stream_topic.models.tntm:fit:489 - --- Training TNTM topic model ---
You are using a CUDA device ('NVIDIA A800 80GB PCIe') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
/hongyi/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:652: Checkpoint directory /hongyi/STREAM/checkpoints exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
| Name | Type | Params | Mode
---------------------------------------------------------------------
0 | model | TNTMBase | 5.2 M | train
1 | model.inference_network | InferenceNetwork | 5.2 M | train
2 | model.mean_bn | BatchNorm1d | 10 | train
3 | model.logvar_bn | BatchNorm1d | 10 | train
4 | model.beta_batchnorm | BatchNorm1d | 16.1 K | train
5 | model.theta_drop | Dropout | 0 | train
---------------------------------------------------------------------
5.2 M Trainable params
8.1 K Non-trainable params
5.2 M Total params
20.916 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%
0/2 [00:00<?, ?it/s]
/hongyi/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=255` in the `DataLoader` to improve performance.
2024-12-19 15:49:38.955 | ERROR | stream_topic.models.tntm:fit:496 - Error in training: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[2], line 3
1 from stream_topic.models import KmeansTM,CEDC, ETM,DCTE,LDA,ProdLDA,NSTM,CTM,CTMNeg,CBC,BERTopicTM,TNTM
2 model = TNTM(word_embedding_model_name="/hongyi/stream/sentence-transformers/paraphrase-MiniLM-L3-v2",embedding_model_name="/hongyi/stream/sentence-transformers/all-MiniLM-L6-v2")#
----> 3 model.fit(dataset,n_topics=5)#
5 topics = model.get_topics()
6 print(topics)
File ~/STREAM/stream_topic/models/tntm.py:493, in TNTM.fit(self, dataset, n_topics, val_size, lr, lr_patience, patience, factor, weight_decay, max_epochs, batch_size, shuffle, random_state, inferece_type, checkpoint_path, monitor, mode, trial, optimize, **kwargs)
490 self._status = TrainingStatus.RUNNING
491 # self.model.to("cuda:0")
492 # print(self.model.device)
--> 493 self.trainer.fit(self.model, self.data_module)
495 except Exception as e:
496 logger.error(f"Error in training: {e}")
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:543, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
541 self.state.status = TrainerStatus.RUNNING
542 self.training = True
--> 543 call._call_and_handle_interrupt(
544 self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
545 )
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py:44, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
42 if trainer.strategy.launcher is not None:
43 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
---> 44 return trainer_fn(*args, **kwargs)
46 except _TunerExitException:
47 _call_teardown_hook(trainer)
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:579, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
572 assert self.state.fn is not None
573 ckpt_path = self._checkpoint_connector._select_ckpt_path(
574 self.state.fn,
575 ckpt_path,
576 model_provided=True,
577 model_connected=self.lightning_module is not None,
578 )
--> 579 self._run(model, ckpt_path=ckpt_path)
581 assert self.state.stopped
582 self.training = False
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:986, in Trainer._run(self, model, ckpt_path)
981 self._signal_connector.register_signal_handlers()
983 # ----------------------------
984 # RUN THE TRAINER
985 # ----------------------------
--> 986 results = self._run_stage()
988 # ----------------------------
989 # POST-Training CLEAN UP
990 # ----------------------------
991 log.debug(f"{self.__class__.__name__}: trainer tearing down")
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:1028, in Trainer._run_stage(self)
1026 if self.training:
1027 with isolate_rng():
-> 1028 self._run_sanity_check()
1029 with torch.autograd.set_detect_anomaly(self._detect_anomaly):
1030 self.fit_loop.run()
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py:1057, in Trainer._run_sanity_check(self)
1054 call._call_callback_hooks(self, "on_sanity_check_start")
1056 # run eval step
-> 1057 val_loop.run()
1059 call._call_callback_hooks(self, "on_sanity_check_end")
1061 # reset logger connector
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py:182, in _no_grad_context.<locals>._decorator(self, *args, **kwargs)
180 context_manager = torch.no_grad
181 with context_manager():
--> 182 return loop_run(self, *args, **kwargs)
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py:135, in _EvaluationLoop.run(self)
133 self.batch_progress.is_last_batch = data_fetcher.done
134 # run step hooks
--> 135 self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
136 except StopIteration:
137 # this needs to wrap the `*_step` call too (not just `next`) for `dataloader_iter` support
138 break
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py:396, in _EvaluationLoop._evaluation_step(self, batch, batch_idx, dataloader_idx, dataloader_iter)
390 hook_name = "test_step" if trainer.testing else "validation_step"
391 step_args = (
392 self._build_step_args_from_hook_kwargs(hook_kwargs, hook_name)
393 if not using_dataloader_iter
394 else (dataloader_iter,)
395 )
--> 396 output = call._call_strategy_hook(trainer, hook_name, *step_args)
398 self.batch_progress.increment_processed()
400 if using_dataloader_iter:
401 # update the hook kwargs now that the step method might have consumed the iterator
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py:311, in _call_strategy_hook(trainer, hook_name, *args, **kwargs)
308 return None
310 with trainer.profiler.profile(f"[Strategy]{trainer.strategy.__class__.__name__}.{hook_name}"):
--> 311 output = fn(*args, **kwargs)
313 # restore current_fx when nested context
314 pl_module._current_fx_name = prev_fx_name
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py:411, in Strategy.validation_step(self, *args, **kwargs)
409 if self.model != self.lightning_module:
410 return self._forward_redirection(self.model, self.lightning_module, "validation_step", *args, **kwargs)
--> 411 return self.lightning_module.validation_step(*args, **kwargs)
File ~/STREAM/stream_topic/models/abstract_helper_models/neural_basemodel.py:46, in NeuralBaseModel.validation_step(self, batch, batch_idx)
45 def validation_step(self, batch, batch_idx):
---> 46 val_loss = self.model.compute_loss(batch)
48 self.log(
49 "val_loss",
50 val_loss,
(...)
54 logger=True,
55 )
57 return val_loss
File ~/STREAM/stream_topic/models/neural_base_models/tntm_base.py:215, in TNTMBase.compute_loss(self, x)
201 """
202 Computes the loss for the model.
203
(...)
212 The computed loss.
213 """
214 x_bow = x['bow']
--> 215 log_recon, posterior_mean, posterior_logvar = self.forward(x)
216 loss = self.loss_function(x_bow, log_recon, posterior_mean, posterior_logvar)
217 return loss
File ~/STREAM/stream_topic/models/neural_base_models/tntm_base.py:143, in TNTMBase.forward(self, x)
124 """
125 Forward pass through the network.
126
(...)
139 The log variance of the variational posterior.
140 """
141 theta, posterior_mean, posterior_logvar = self.get_theta(x)
--> 143 log_beta = self.calc_log_beta()
147 # prodLDA vs LDA
148 # use numerical trick to compute log(beta @ theta )
149 log_theta = torch.nn.LogSoftmax(dim=-1)(theta) #calculate log theta = log_softmax(theta_hat)
File ~/STREAM/stream_topic/models/neural_base_models/tntm_base.py:112, in TNTMBase.calc_log_beta(self)
109 log_probs = torch.zeros(self.n_topics, self.vocab_size)
111 for i, dis in enumerate(normal_dis_lis):
--> 112 log_probs[i] = dis.log_prob(self.word_embeddings_projected)
113 return log_probs
File ~/anaconda3/envs/mystream/lib/python3.10/site-packages/torch/distributions/lowrank_multivariate_normal.py:214, in LowRankMultivariateNormal.log_prob(self, value)
212 if self._validate_args:
213 self._validate_sample(value)
--> 214 diff = value - self.loc
215 M = _batch_lowrank_mahalanobis(
216 self._unbroadcasted_cov_factor,
217 self._unbroadcasted_cov_diag,
218 diff,
219 self._capacitance_tril,
220 )
221 log_det = _batch_lowrank_logdet(
222 self._unbroadcasted_cov_factor,
223 self._unbroadcasted_cov_diag,
224 self._capacitance_tril,
225 )
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Finally I tried to move both self.model and its parameters to “cuda:0”, but it still reports the same error.
Metadata
Metadata
Assignees
Labels
No labels