Skip to content

R Community Explorer

Meet Bhatnagar edited this page Feb 2, 2021 · 16 revisions

Background

The R language boasts of a very large, active, and diverse global community that is the central hub of its ecosystem. While many aspects of the R ecosystem are continuously expanding in popularity, the R language might have faced biases in language rankings from a more comprehensive perspective. A data-driven exploration of the R ecosystem would be helpful to allow users and onlookers accurately judge trends by themselves.

As more sub-communities and diverse members are added to the R community, and newer tools are developed, it seems important to monitor several important developments around the R ecosystem in a data-driven manner. Some of these developments exists around the kind of packages that are most frequently downloaded from CRAN over the years. Another development is around the trending R topics of interest on Twitter for #rstats users. GitHub being very popular for hosting open source projects could help provide insights to what R users are more interested in over time.

Related work

R Community Explorer began in 2019 as a GSoC project that focused on aggregating information regarding R User Groups, R-Ladies Chapters and R-GSoC projects and building static dashboards that render these information via interactive visualization and data-widgets. However, R Community Explorer is still far from being a complete exploration of the R community. Many aspects of the R ecosystem like Twitter, GitHub, CRAN, Google Code-In, R Events, etc still require data-driven exploration to give a larger picture of the growth and popularity of the R language.

Details of your coding project

1a. CRAN Exploration - Develop R scripts that extract CRAN logs and R package metadata, tidy and analyze the data, store them and produce visualizations. Track information like: total number of packages on CRAN, daily downloads, monthly downloads, yearly downloads, most depended-upon packages, top 50 downloaded packages in a time span, popular package keywords, popular package authors, most active authors (based on package update frequency), etc Some Visualizations: Frequency of Updated R Packages (monthly/yearly), Frequency of New R Packages (monthly/yearly), word-cloud of package keywords, etc. Could package authors see a network of dependencies around their packages, and possibly see which packages get affected if their package is removed from CRAN? Possibly connect this data to a dashboard using JavaScript.

1b. CRAN Task Views - Write R code that reads and analyzes data from CRAN Task Views. Rank task view packages based on downloads and update frequency.

  1. Twitter exploration of #rstats tweets via Twitter API based on rtweet R package.

    Some coding expectations here are: extracting as much past #rstats tweets as possible, updating this archived data with latest #rstats tweets daily, tidying this data and storing it, obtaining the frequency and average of #rstats tweets per day/week/month, trending #rstats topics per day/week/month, most popular #rstats tweet accounts, etc. Possibly connect this data to a dashboard using JavaScript.

Please find some ideas here: https://gadenbuie.shinyapps.io/tweet-conf-dash/

  1. GitHub Exploration of R repos via the GitHub API

    Some expectations here are: obtaining the total number of R packages/repos on GitHub, most starred repos, trending repos per day/week/month, trending developers, etc. Maintain an open log of data for all GitHub R repos for others to download and explore since it is not easy for everyone to download this data from the GitHub API easily. Possibly connect this data to a dashboard using JavaScript.

Skills

R, JavaScript

Expected impact

The importance of this project lies in the fact that several key players in the R ecosystem would function better if they have insights on what R users are more interested in over time as things evolve. Players like the R Foundation, R Consortium, organizations that build their portfolio around R services could be better informed on key areas of most interest to R users.

Almost every user or organization interested in the R language would most likely be interested to understand the popularity of the R language over time and take decisions based on the decline or rise in popularity of the language. The same could be applicable to R Foundation sponsors and R Consortium members. This project hopes to build the framework for monitoring the popularity of R from several views by tracking CRAN activity, global R events, R user group growth, Twitter, StackOverflow, and several others.

Mentors

EVALUATING MENTOR: Ben Ubah ([email protected]) is the primary maintainer of R Community Explorer, an R package author, GCI mentor and has prior GSoC experience.

SECOND MENTOR: Rick Pack ([email protected]) is a contributor to R Community Explorer, past mentor for this project and a data scientist at LabCorp

THIRD MENTOR: Gergely Daroczi [email protected] is the author of eg pander, AWR, botor and logger packages. He is a Director of Data Ops at System1, and the organizer of the Budapest Users of R Network, a satRday in 2016 and the European R Users Meeting 2018.

Tests

  1. Create an R function that can run searches on the GitHub API and retrieve R language repos per month per year. The function should take one argument (year) and when called should return .rds files containing data for R language repos that were created in each month of that year.

Create a second function that can read and combine the data from all .rds files gotten from the output above, and save them as one CSV file and one JSON file.

An example of this system is already available by Hadley here: https://github.com/hadley/r-on-github You may only need to adapt the code appropriately

  1. Design a way of downloading all tweets that contain the #rstats hashtag every day and stack them day by day without losing tweets or having duplicate tweets. You may want to consider using a continuous integration tool like Travis to daily run a script that pulls the data and joins it to previously stored data.

  2. Make a simple plot using any of echarts.js, d3.js and plotly.js

Solutions of tests

Students, please write your name here, and send a link to your solution via email to avoid plagiarism.

  1. Name - Meet Bhatnagar
  2. Name - Anany Sharma, Test 1, Test 2, Test 3, Link to cumulative repository.
Clone this wiki locally