A Python notebook that analyzes web content performance by counting English words on web pages and correlating with scroll behavior data from Google Analytics.
This tool fetches URLs from a Google Sheet containing content performance data (including scroll percentages), scrapes each web page to count English words, and writes the results back to the Google Sheet. It's designed to help content creators understand the relationship between content length and user engagement.
- Scrapes web pages and counts English words using a comprehensive dictionary
- Integrates with Google Sheets for data input/output
- Handles multiple scroll percentage thresholds (75%, 90%, etc.)
- Cleans and processes HTML content to extract meaningful text
- Automatically creates output sheets with combined analytics and word count data
- Python 3.7+
- Google Cloud Platform account with Sheets API enabled
- Service account credentials (JSON file)
- Access to Google Sheets containing your analytics data
- Clone this repository:
git clone https://github.com/jaymurphy1997/content-performance-word-count.git
cd content-performance-word-count- Install required packages:
pip install -r requirements.txt- Set up Google Sheets API:
- Create a service account in Google Cloud Console
- Enable Google Sheets API and Google Drive API
- Download the credentials JSON file
- Rename it to
credentials.jsonand place it in the project root
-
Prepare your Google Sheet with columns:
page_location: URLs to analyzepage_title: Page titlespercent_scrolled: Scroll threshold (75, 90, etc.)scrolls: Number of scroll events
-
Update the notebook with your:
- Google Sheet URL
- Credentials file path
-
Run the notebook:
jupyter notebook notebooks/content_performance_word_count.ipynbChange the Google Sheet URL:
sheet_url = 'your_google_sheet_url_here'Modify word counting logic:
The return_words() function can be customized to change how words are counted or filtered.
Adjust text cleaning:
Modify the regex pattern in return_words() to change how text is cleaned before word counting.
page_location: Full URL of the page to analyzepage_title: Title of the pagepercent_scrolled: Scroll depth threshold (75, 90, etc.)scrolls: Number of users who scrolled to this depth
All input columns plus:
words: Count of English words found on the page
- Dictionary Loading: Downloads a comprehensive English word list from GitHub
- Web Scraping: Uses BeautifulSoup to extract text content from each URL
- Text Processing: Cleans HTML, removes punctuation, converts to lowercase
- Word Counting: Counts only words that exist in the English dictionary
- Data Export: Writes results to a new 'output' sheet in your Google Spreadsheet
page_location: https://en.wikipedia.org/wiki/Ontario_Motor_Speedway
page_title: Ontario Motor Speedway - Wikipedia
percent_scrolled: 75
scrolls: 2339
page_location: https://en.wikipedia.org/wiki/Ontario_Motor_Speedway
page_title: Ontario Motor Speedway - Wikipedia
percent_scrolled: 75
scrolls: 2339
words: 724
- Keep your
credentials.jsonfile secure and never commit it to version control - The notebook includes the credentials filename in the code - update this path as needed
- Consider using environment variables for sensitive configuration
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-feature) - Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/new-feature) - Create a Pull Request
Common Issues:
- Authentication Error: Ensure your service account has access to the Google Sheet
- URL Access Error: Some pages may block scraping - these will return 0 words
- Memory Issues: For large datasets, consider processing in batches
Rate Limiting: The script doesn't include delays between requests. Add time.sleep() if you encounter rate limiting.
This project is licensed under the MIT License - see the LICENSE file for details.
- Uses the english-words dictionary
- Built with BeautifulSoup, pandas, and gspread