The dataset used in this project is taken from Kaggle and contains synthetic, Creative Commons publicly licensed data. The scenario and business requirements described in this project are fictional and are created solely for the purpose of this analysis.
The data consists of 2007 rows and 13 columns:
PetID: Unique identifier for each pet
PetType: Type of pet (e.g., Dog, Cat, Bird, Rabbit)
Breed: Specific breed of the pet
AgeMonths: Age of the pet in months
Color: Color of the pet
Size: Size category of the pet (Small, Medium, Large)
WeightKg: Weight of the pet in kilograms
Vaccinated: Vaccination status of the pet (0 - Not vaccinated, 1 - Vaccinated)
HealthCondition: Health condition of the pet (0 - Healthy, 1 - Medical condition)
TimeInShelterDays: Duration the pet has been in the shelter (days)
AdoptionFee: Adoption fee charged for the pet (in dollars)
PreviousOwner: Whether the pet had a previous owner (0 - No, 1 - Yes)
AdoptionLikelihood: Likelihood of the pet being adopted (0 - Unlikely, 1 - Likely)
The manager of an animal rescue centre in Indiana has requested an analysis of the centre’s existing adoption data to better understand which factors influence an animal’s likelihood of being adopted. This project will explore trends within the dataset and identify the characteristics most strongly associated with successful adoptions.
Effective prediction of adoption likelihood would help the rescue centre prioritise animal care, optimise its resources and improve operational planning. If the project outcomes demonstrate clear value, the approach will be expanded to include data from additional rescue centres across the state.
The rescue centre is one of many in a group of centres under the same company umbrella. They have requested that, where possible, the brand colour lilac be used in the dashboard. This will help if the dashboard is presented to management and trustees.
The four hypotheses will be explored, firstly with simple visualisations in jupyter notebooks, then with statistical tests.
-
H1: Younger animals are more likely to get adopted
- Null hypothesis: Age has no effect on the likelihood of an animal being adopted
- Alternative hypothesis: Younger animals are more likely to be adopted
-
H2: Vaccinated animals are more likely to get adopted
- Null hypothesis: Vaccination has no effect on the likelihood of an animal being adopted
- Alternative hypothesis: Vaccinated animals are more likely to be adopted
-
H3: Larger animals are less likely to get adopted
- Null hypothesis: Size has no effect on the likelihood of an animal being adopted
- Alternative hypothesis: Size does have an effect on the likelihood of an animal being adopted
-
H4: Some types of animals are more popular than others
- Null hypothesis: Pet type has no effect on the likelihood of an animal being adopted
- Alternative hypothesis: Pet type has an effect on the likelihood of adoption
-
Train a machine learning model to:
- Select which variables are most useful in predicting adoption outcomes
- Predict if new animals are likely to be adopted or not
Ideation- choose a dataset and ideate business requirementsExtract- Extract the chosen animal adoption dataset from Kaggle.Load- Load and save the csv file via Pandas.Transform- Clean and process the data using Pandas, adding new columns and checking for missing or duplicated values. Save to a new file.Visualise- Create charts with Matplotlib, Seaborn and Plotly to visualise the data and check for outliers. Test the hypotheses in separate jupyter notebooks.Dashboard- Create a Power BI dashboard to provide interactive insights.Analyse- Interpret what the dashboard visualisations display, with plentiful comments in notebooks.Document- Record findings and conclusions in the Readme and Dashboard.
Use CRISP-DM: Cross Industry Standard Process for Data Mining Use the Agile approach: small iterations with constant evaluation. Create a thorough and effective Kanban board to keep on track with tasks.
| Hypothesis | Visualisations | Statistical Test |
|---|---|---|
| H1 - Age | Seaborn boxplot: Age vs Adoptionlikelihood | Mann-Whitney U Test |
| H2 - Vaccination | Seaborn countplot: Vaccinated vs Adoption | Chi-Squared Test |
| H3 - Size | Plotly pie charts: Pet Types | Chi-Squared Test |
| H4 - Type | Plotly bar, sns countplot: Type vs Adoption | Chi-Squared Test |
| ML model | Decision Tree | OneHotEncoder/StandardScaler |
This heatmap created with Seaborn helped to clarify which variables are of interest.
- Basic probability such as independence testing, and distribution analysis is used to understand how the variables affect adoption outcomes. Python is used to compute these probabilities directly from the data and to visualize underlying distributions. Specific reasoning behind the choice of statistical tests:
- In the exploratory data analysis a heatmap was created using Seaborn to visualise the correlations between variables. The first heatmap was too large and confusing, so a second heatmap was created with variables of interest.
Hypothesis 1 - Age:
Hypothesis 2 - Vaccination:
Hypothesis 3 - Pet Size:
Hypothesis 4 - Pet Type:
-
H1: The data is not normally distributed and the data is independent (one row per pet and no time series data); with one categorical variable and one continuous variable, therefore a Mann-Whitney U Test was used.
-
H2, H3 and H4: The data is not normally distributed, the data is independent; both variables are categorical, therefore a Chi-Squared Test was used.
-
The Pearson value for the p-value was chosen during the Chi-Squared Testing in Hypothesis 2 because it is the most widely accepted measure for detecting general associations between two categorical variables.
- Why a decision tree classifier? The target variable to be predicted is a binary output and therefore a classification model is appropriate. A decision tree was chosen because it is easy to interpret the outputs and it can handle both numerical and categorical features. A decision tree is suitable for identifying the most influential variables in predicting adoption outcomes.
- The data was encoded, ready for the ML model using the straightforward OneHotEncoder.
- The data was scaled using the StandardScaler to ensure that all features were on a comparable scale, which prevents features with larger numerical ranges from dominating the training process.
- When confirming H1 the The p-value initially came out as 0.0. This is highly unusual so chatgpt was consulted and used to find an alternative way using scipy.stats to conduct the Mann-Whitney U Test that would give a more accurate result. Indeed the p-value was just incredibly tiny.
- Unfortunately in this synthetic dataset some of the variables could not be used as they were meaningless.
- The adoption fee was simply all of the numbers 1-499 listed, therefore it was not used.
- The weights of some of the animals made no sense; rabbits are not generally over 2.5kg and some of them in the data were over 20kg. This wasn't a huge problem as there is also a categorical Size column.
Generative AI tools such as ChatGPT and GitHub Copilot played a valuable role throughout this project. ChatGPT helped refine the project outline, business requirements, and data ethics section by turning initial ideas and bullet point prompts into professional content. It also supported the development of the machine learning model by guiding the order of steps, evaluation methods, and feature extraction.
Copilot was useful for troubleshooting specific coding issues in VSCode, such as fixing the WordCloud banner and resolving difficulties accessing the data file. It also suggested clear variable names and encouraged better code commenting.
Throughout the process I learned the importance of asking precise questions; vague prompts can lead to unhelpful or confusing suggestions. Overall, generative AI acted like a collaborative partner: offering guidance, improving clarity, and helping maintain momentum when confidence dipped.
- The data is available publically on Kaggle, with a Creative Commons license (please see the Credits > Content section below).
- Provenance/ Dataset Description from Kaggle:
- "The Pet Adoption Dataset provides a comprehensive look into various factors that can influence the likelihood of a pet being adopted from a shelter. This dataset includes detailed information about pets available for adoption, covering various characteristics and attributes."
- "This dataset is synthetic and was generated for educational purposes, making it ideal for data science and machine learning projects. It is an original dataset, owned by Mr. Rabie El Kharoua, and has not been previously shared. You are free to use it under the license outlined on the data card. The dataset is offered without any guarantees."
- There is no information present in the data that could identify an animal or person specifically. A PetID column was provided in the data, a decision was made to remove it.
- If this were real data then it would be necessary to inform the person adopting the pet about the data that is stored and how it will be used to help the shelter and future rescue animals. Adhering to GDPR and following the guidelines of the EU AI Act.
- Please refer to the Data Ethics section of the Power BI dashboard.
- The Power BI dashboard is saved in the "dashboard" folder here in the repository.
- The initial wireframe drawing is also saved as an image .png file in the dashboard folder.
- Screenshots of each page are saved to the dashboard folder.
Dashboard pages:
Main Page: for non-technical audiences. Storytelling via visuals. The data can be explored through the use of four sliders.
- Sliders: size of pet, previous owner, adoption and vaccinated
- Cards showing: total number of animals, total number of adoptions, average number of days in the shelter and average animal age. Which update depending on which slider is chosen or which graph highlighted.
- Top left visual: scatter plot: Adoption Likelihood by Age
- Top right visual: clustered column chart: Distribution of Health Condition within Pet Types
- Bottom left visual: clustered column chart: Distribution of Size within Pet Types
- Bottom right visual: pie chart: Distribution of Pet Types
Tree Map Page: for non-technical audiences. Tree map of Pet Type > Breed > Colour to explore their relationships. With a key and explanation at the bottom. There is a card to show the total number of animals shown on the current tree map.
Data Ethics and Governance Page: for technical audiences. The text was produced with the help of generative AI. Detailed prompts were provided to chatGPT, with many refinements to get to the final, professional outcome.
Conclusions Page: for technical audiences. Business Requirements and Conclusions section. Data source included.
- User testing: suggested it would be a good idea to add my name to the dashboard and the Code Institute logo. Extra tooltips were added to the visuals on the main page.
- The information was split into four separate pages to keep the main visuals on one page and clear. The explanations were kept together for the technical audience.
- For the non-technical audience on the bottom right of the main page, under the slicers are examples of how to use them.
- Power BI is a very useful tool: if more data becomes available, as long as it is in the same columns and format, it could simply be added to the existing data and all of the visuals would update.
Hypothesis 1: The alternate hypothesis is correct: Younger animals are more likely to be adopted.
Hypothesis 2: The alternate hypothesis is correct: Vaccinated animals are more likely to be adopted.
- Hypothesis 2a: There is no correlation between Vaccination and Health Condition. Perhaps this needs to be investigated.
Hypothesis 3: The alternate hypothesis is correct: Size does have an effect on the likelihood of adoption.
Hypothesis 4: The alternate hypothesis is correct: The type of pet does have an effect on the likelihood of adoption.
Predictive modelling using a Decision Tree Classifier shows that the most important features are: Medium Size; Age; Vaccination; Health Condition and Labrador.
Class 0: Unlikely Adoption
Precision = 0.92 When the model predicts unlikely adoption, it’s correct 92% of the time Recall: 0.93 The model catches 93% of all real unlikely adopted cases F1 = 0.93 Strong overall performance
This class is performing very well, probably because it has more samples (270 vs 132).
Class 1: Likely Adoption
Precision: 0.86 When the model predicts likely adoption, it's correct 86% of the time Recall: 0.84 The model catches 84% of the actual adopted animals F1 = 0.85 Good overall performance
Performance is good but not as strong as for the unlikely adoption class, which is probably due to class imbalance (132 vs 270).
18 false positives (model says likely to get adopted but wasn’t).
21 false negatives (model misses an actual likely adopted case).
The number of false negatives would have to be discussed with management and likely improved because we're trying to identify animals likely to be adopted, missing actual adopters may be costly.
The model can predict for the rescue centre whether a new animal is likely to be adopted (provided no unforeseen variables are used).
In reality datasets like these could be used to help rescue centres, as with this foundation in the US: https://www.shelteranimalscount.org/
To take this further and to make a more accurate predictive model it would be great to get data around whether the animals are neutered and if a profile of them exists (no actual details about the animal for GDPR, just the existence of a profile with a name, photo or description).
The manager of the animal rescue centre in Indiana now has:
- A data-driven understanding of the factors that influence adoption outcomes, as visualised in the dashboard.
- A working predictive model for identifying animals with higher or lower adoption likelihood.
- Recommendations for future data collection and process improvements.
- A foundation for scaling the analysis to other rescue centres nationwide.
- ipykernel needed to be installed to ensure that the notebooks use the virtual environment where the packages are installed. Pip was also upgraded to ensure everything ran smoothly.
- In the notebook 06_hypothesis_3_and_4, in cell 14 of 28, there is a future warning. The code to convert the variables to strings for use in plotly will produce an error after a future update.
- The Code Institute Data Analytics template was cloned from git hub and the following python libraries were added to the requirements.txt file: wordcloud, pingouin and nbformat.
- In notebook 02_eda_visuals I had some trouble creating a heatmap. There are so many variables in the correlation that it was difficult to glean any information. I tried a few times, with the aid of generative AI, to manipulate the heatmap using the original correlated dataframe. However, it was taking a long time and I didn't want to get bogged down so early on. Therefore I decided to create a simpler correlated dataframe with just the variables I was interested in. This made for a clearer and more useful heatmap. I now know that correlation tables are something I need to learn in more depth.
- In order to learn the most important features from the machine learning classification model, I had to rely heavily on code from the Code Institute's teachings and help from chatGPT. Evaluating the machine learning model is something I really need to go back over and understand more thoroughly.
- Different hyperparameters were not tested in this project. That will be something I try out in a personal project after the course.
- The next logical step in Power BI would be to use DAX to create new measures to take a deeper dive in to the data and discover further relationships between the variables.
-
Please fork this repo and explore the data. Once your virtual environment is created, please use the requirements.txt file to get the required python libraries.
-
The Power BI dashboard is saved in the "dashboard" folder here in the repository. Download the file to open.
These are the main python libraries used for the data analysis, in alphabetical order:
Feature-engine – For feature engineering in the ML model
Joblib – For saving and loading the ML model
Matplotlib – For creating overall multi-chart layouts
NumPy – For numerical operations and data transformation
Pandas – For data loading, transformation, and cleaning
Pingouin – For statistical tests
Plotly – For building interactive charts
SciPy – For statistical tests.
Scikit-learn – For machine learning algorithms
Seaborn – For generating many of the individual charts
- Hindsight is a wonderful thing. In Power BI it would have been more useful to still have the pet ID column to create visuals.
- Leading up to the Christmas break is a very difficult and busy time of year with lots of distractions. The pomodoro technique was useful, along with lots of planning, and the need to be adaptable.
- As this project has progressed I've been increasingly pleased that I chose this dataset as the adoption cause is a good one and it interests me. It's a shame that the dataset is synthetic and falls down in some areas.
- This article https://articles.hepper.com/pet-adoption-statistics-uk/ highlights the sheer number of animals in rescue centres in the UK; in reality, analysis of adoption data is hugely important to understand the whys of how animals are adopted and become pets. Sadly, in rare cases the rescue centres have no option left but euthanasia so it is crucial to try and get as many animals to new homes as possible.
- This dataset only addresses the second part of the story. The reasons for animals ending up in rescue centres is a whole other set of variables.
- During the initial exploration of datasets and the ideation phase I took inspiration from the Related Notebooks section of Kaggle. I am very grateful to those people who shared their projects for others to view. I have upvoted and left positive feedback.
- Template: https://github.com/Code-Institute-Org/data-analytics-template
- Data sourced from Kaggle: https://www.kaggle.com/datasets/rabieelkharoua/predict-pet-adoption-status-dataset/data
- The data is shared under the Creative Commons Licence: CC BY 4.0 International
- https://doi.org/10.34740/kaggle/ds/5242440
- For the initial retrieval of data and EDA I reused code from my previous two projects on github.
- In notebook 03_hypothesis_1.ipynb the definitions of alpha and p-value were taken from the Code Institute's Learning Management System, from the Foundational Data Analysis Techniques section.
- In the machine learning model, notebook 05_mlearning.ipynb the Code Institute's teaching was used heavily, along with generative AI, to help extract the most important features learnt from the Classification model.
- Photo at the bottom of the readme file Rabbit Vectors by Vecteezy
- I would like to say a huge thank you to my Tutor and Data Coaches at Code Institute for their teaching, advice and support.
- I am grateful to my fellow September 2025 cohort: for the help and the laughs.
- A final thank you to generative AI (Copilot and chatGPT) for their assistance and suggestions when I needed a little nudge in the right direction or to clarify an idea.













