Content Performance - Word Count Analysis

A Python notebook that analyzes web content performance by counting English words on web pages and correlating with scroll behavior data from Google Analytics.

Overview

This tool fetches URLs from a Google Sheet containing content performance data (including scroll percentages), scrapes each web page to count English words, and writes the results back to the Google Sheet. It's designed to help content creators understand the relationship between content length and user engagement.

Features

Scrapes web pages and counts English words using a comprehensive dictionary
Integrates with Google Sheets for data input/output
Handles multiple scroll percentage thresholds (75%, 90%, etc.)
Cleans and processes HTML content to extract meaningful text
Automatically creates output sheets with combined analytics and word count data

Prerequisites

Python 3.7+
Google Cloud Platform account with Sheets API enabled
Service account credentials (JSON file)
Access to Google Sheets containing your analytics data

Installation

Clone this repository:

git clone https://github.com/jaymurphy1997/content-performance-word-count.git
cd content-performance-word-count

Install required packages:

pip install -r requirements.txt

Set up Google Sheets API:
- Create a service account in Google Cloud Console
- Enable Google Sheets API and Google Drive API
- Download the credentials JSON file
- Rename it to credentials.json and place it in the project root

Usage

Quick Start

Prepare your Google Sheet with columns:
- page_location: URLs to analyze
- page_title: Page titles
- percent_scrolled: Scroll threshold (75, 90, etc.)
- scrolls: Number of scroll events
Update the notebook with your:
- Google Sheet URL
- Credentials file path
Run the notebook:

jupyter notebook notebooks/content_performance_word_count.ipynb

Customization

Change the Google Sheet URL:

sheet_url = 'your_google_sheet_url_here'

Modify word counting logic: The return_words() function can be customized to change how words are counted or filtered.

Adjust text cleaning: Modify the regex pattern in return_words() to change how text is cleaned before word counting.

Data Structure

Input Sheet Columns

page_location: Full URL of the page to analyze
page_title: Title of the page
percent_scrolled: Scroll depth threshold (75, 90, etc.)
scrolls: Number of users who scrolled to this depth

Output Sheet Columns

All input columns plus:

words: Count of English words found on the page

How It Works

Dictionary Loading: Downloads a comprehensive English word list from GitHub
Web Scraping: Uses BeautifulSoup to extract text content from each URL
Text Processing: Cleans HTML, removes punctuation, converts to lowercase
Word Counting: Counts only words that exist in the English dictionary
Data Export: Writes results to a new 'output' sheet in your Google Spreadsheet

Examples

Sample Input Data (fake data)

page_location: https://en.wikipedia.org/wiki/Ontario_Motor_Speedway
page_title: Ontario Motor Speedway - Wikipedia
percent_scrolled: 75
scrolls: 2339

Sample Output

page_location: https://en.wikipedia.org/wiki/Ontario_Motor_Speedway
page_title: Ontario Motor Speedway - Wikipedia
percent_scrolled: 75
scrolls: 2339
words: 724

Security Notes

Keep your credentials.json file secure and never commit it to version control
The notebook includes the credentials filename in the code - update this path as needed
Consider using environment variables for sensitive configuration

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/new-feature)
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature/new-feature)
Create a Pull Request

Troubleshooting

Common Issues:

Authentication Error: Ensure your service account has access to the Google Sheet
URL Access Error: Some pages may block scraping - these will return 0 words
Memory Issues: For large datasets, consider processing in batches

Rate Limiting: The script doesn't include delays between requests. Add time.sleep() if you encounter rate limiting.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Uses the english-words dictionary
Built with BeautifulSoup, pandas, and gspread

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.ipynb_checkpoints		.ipynb_checkpoints
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SETUP_GUIDE.md		SETUP_GUIDE.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Content Performance - Word Count Analysis

Overview

Features

Prerequisites

Installation

Usage

Quick Start

Customization

Data Structure

Input Sheet Columns

Output Sheet Columns

How It Works

Examples

Sample Input Data (fake data)

Sample Output

Security Notes

Contributing

Troubleshooting

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Content Performance - Word Count Analysis

Overview

Features

Prerequisites

Installation

Usage

Quick Start

Customization

Data Structure

Input Sheet Columns

Output Sheet Columns

How It Works

Examples

Sample Input Data (fake data)

Sample Output

Security Notes

Contributing

Troubleshooting

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages