Mining software repositories is related to both data mining and reverse engineering. Source control repositories, bug repositories, archived communications, deployment logs, and code repositories are examples of software repositories that are commonly available for most software projects. The Mining Software Repositories (MSR) field analyzes and cross-links the rich data available in these repositories to uncover interesting and actionable information about software systems. By transforming these repositories from static record-keeping ones into active repositories, we can guide decision processes in modern software projects. For example, data in source control repositories, traditionally used to archive code, could be linked with data in bug repositories to help practitioners propagate complex changes and to warn them about risky code based on prior changes and bugs. Ahmed E. Hassan. 2008. In this lab you will mine software repositories, extract and analyze some interesting software artifacts archived in the repositories.

Material and Tools

Session slides here.
The following GitHub repos you can use to run the lab experiments
- scottyab/rootbeer
- Skyscanner/backpack (Try it from home since it has very many commits)
- mendhak/gpslogger (Try it from home since it has very many commits)
- thundernest/k-9 (Try it from home since it has very many commits)
Resources of how to obtain data on GitHub
- GitHub API
- How to create GitHub token or you could use this Generate Token if you do not want to read details.

Steps/Preparations

The following tasks will guide you to carry out the labs of mining software repositories. In the class slides, we have looked at three ways how we can get data from GitHub: cloning the repos using Git, using GHTorrent and using the GitHub API. In this lab we shall go through getting GitHub data using the GitHub API.

Task 1

Create a GitHub token(s) by following the tutorial on the link. Each token corresponds to a GitHub account.
Fork my repo johnxu21/msrLab. Thereafter, clone the fork onto your laptop to have a local copy of the source code.
Browse the src folder and rename the file CollectFiles.py to <your-names>_CollectFiles.py.
Replace the fake tokens in the code with your own token.
Thereafter, run the file <your-names>_CollectFiles.py and look at the output. The code collects all the files in a repo and also the number of counts the file is touched throughout its lifetime.

Task 2

A repository contains both source files and other files like configuration files. Developers spend most of the time changing source files for many reasons, for example, fixing bugs, extending them with new features, or refactoring. The script CollectFiles.py collects all files in a repository. So your first task is to adapt the script to gather only the source files. You can find a repo’s programming languages on the bottom right of the repo’s page on GitHub (some repos could be written in more than one programming language).

First, write a script with the name <'your_firstname'_authorsFileTouches.py> that collects the authors and the dates when they touched for each file in the list of files generated by the adapted file CollectFiles.py (only source files).
Second, write a script that generates a scatter plot (using matplotlib) of weeks vs file variables where the points are shaded according to author variable. Each author should have a distinct color. Looking at the scatter plot one should be able to tell a file that is touched many times and by whom. This can help, for example, when identifying refactoring opportunities, which developer should be allocated the task since they have touched a file many times or have recently worked on the file. You can name the script for drawing the histogram <'your_firstname'_scatterplot.py>. You get a hint on how draw the scatter plot on this link on Stackoverflow.

Example (scottyab/rootbeer)
The repository scottyab/rootbeer has a total of 17 unique source files (‘.java’). It has a total of 33 authors who have touched the 17 unique files (the data points in the graph) who have been updating the files and committing their changes. The scatter plot below shows the authors activities over time for the repository scottyab/rootbeer.

rootbeer

Task 3

Write a script to extract data from the following merged pull requests [11791, 11686, 11591, 12159, 12073, 11981, 11867, 11991, 12207, 11926, 11847] in the repository apache/kafka. Read about how to extract pull request from the following site GitHub Pull Request API For example, given a pull request number = 1347 for the repo octocat/Hello-World, the follwoing API would return a json file the pull request details: https://api.github.com/repos/octocat/Hello-World/pulls/1347

The file Pull request details is an example output for the pull request number 11577 and 11686 for the repo apache/kafka. Store the output in .csv file.