Version control

Overview

Teaching: 35 min
Exercises: 0 min

Questions

How can I record the revision history of my code?

Objectives

Configure git the first time it is used on a computer.

Create a local Git repository.

Go through the modify-add-commit cycle for one or more files.

Explain what the HEAD of a repository is and how to use it.

Identify and use Git commit numbers.

Compare various versions of tracked files.

Restore old versions of files.

Follow along

For this lesson participants follow along command by command, rather than observing and then completing challenges afterwards.

What is version control?

A version control system stores a main copy of your code in a repository, which you can’t edit directly.

Instead, you checkout a working copy of the code, edit that code, then commit changes back to the repository.

In this way, the system records a complete revision history (i.e. of every commit), so that you can retrieve and compare previous versions at any time.

This is useful from an individual viewpoint, because you don’t need to store multiple (but slightly different) copies of the same script.

“Piled Higher and Deeper” by Jorge Cham, http://www.phdcomics.com

It’s also useful from a collaboration viewpoint (including collaborating with yourself across different computers) because the system keeps a record of who made what changes and when.

What is Git?

git is software that lets you work on a set of files without losing track of previous versions of those files
git enables multiple people to work on the same set of files simultaneously
git keeps track of changes in a repository so that they
- won’t be lost
- they can be shared
git is one of several tools for revision control, also known as version control

What is Github?

We will delve into github a bit more in the next lesson.

GitHub is a free* website that makes it easier to use git.
GitHub facilitates creating, browsing, and sharing git repositories
GitHub provides tools that complement git, such as software licensing, issue tracking, wikis, and various kinds of automation
GitHub includes visualizations that help you understand your version history such as side by side views of changed versions with differences color-coded.
GitHub is one of the largest collections of open-source software in the world
GitHub is one of several websites and/or tools that work with git; Others include Bitbucket and GitLab

Frequently Asked Questions

Is git just for code?
- No, but you should generally only use it for small text files (small = <1MB) because it’s designed and optimized for code.
Can I use git with my favorite programming tools?
- Yes
Is git hard to learn?
- Yes and no. Commands are simple, concepts can be tricky at first.
Is GitHub just for open-source software?
- No, you can use it for copyrighted software and you can create private repositories
Can I use git without using GitHub?
- Yes
Is github free?
- Yes, the core functionality of github is free though there are some limitations such as the number of collaborators in a private repository. There is no longer a limit to the number of private repositories you create in a free account.

More resources

Git and GitHub tutorial by Joe Futrelle, WHOI IS App Dev, 19 June 2020
- Disclaimer: Github continues to update how the site looks and add new features. Some of the screenshots may look different now. Special thanks to Joe for allowing us to pull content from his workshop for this lesson.
- link to slides
- link to recording
Software Carpentry: Version Control with Git
Git cheat sheet

Setup to work with Git locally on your computer

It is possible to use github.com alone to manage your version control, but we are going to show you how to use Git locally to work with files on your computer. This will help demonstrate the foundational concepts of git and make what you see in github clearer.

When we use Git on a new computer for the first time, we need to configure a few things.

We recommend following along with “Git Bash” application for Windows and Terminal (Linux/Mac).

$ git config --global user.name "Your Name"
$ git config --global user.email "you@email.com"

This user name and email will be associated with your subsequent Git activity, which means that any changes pushed to GitHub, BitBucket, GitLab or another Git host server later on in this lesson will include this information.

You only need to run these configuration commands once - git will remember then for next time.

We then need to navigate to our data-carpentry directory and tell Git to initialise that directory as a Git repository.

$ cd ~/Desktop/data-carpentry
$ git init

If we use ls to show the directory’s contents, it appears that nothing has changed:

$ ls -F

data/					script_template.py
plot_precipitation_climatology.py

But if we add the -a flag to show everything, we can see that Git has created a hidden directory within data-carpentry called .git:

$ ls -F -a

./					data/
../					plot_precipitation_climatology.py
.git/				script_template.py

Git stores information about the project in this special sub-directory. If we ever delete it, we will lose the project’s history.

We can check that everything is set up correctly by asking Git to tell us the status of our project:

$ git status

$ git status
On branch master

Initial commit

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	data/
	plot_precipitation_climatology.py
	script_template.py

nothing added to commit but untracked files present (use "git add" to track)

Tracking changes

The “untracked files” message means that there’s a file/s in the directory that Git isn’t keeping track of. We can tell Git to track a file using git add:

$ git add plot_precipitation_climatology.py

and then check that the right thing happened:

$ git status

On branch master

Initial commit

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)

	new file:   plot_precipitation_climatology.py

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	data/
	script_template.py

Git now knows that it’s supposed to keep track of plot_precipitation_climatology.py, but it hasn’t recorded these changes as a commit yet. To get it to do that, we need to run one more command:

$ git commit -m "Initial commit of precip climatology script"

[master (root-commit) 32b1b66] Initial commit of precip climatology script
 1 file changed, 121 insertions(+)
 create mode 100644 plot_precipitation_climatology.py

When we run git commit, Git takes everything we have told it to save by using git add and stores a copy permanently inside the special .git directory. This permanent copy is called a commit (or revision) and its short identifier is 32b1b66 (Your commit may have another identifier.)

We use the -m flag (for “message”) to record a short, descriptive, and specific comment that will help us remember later on what we did and why. If we just run git commit without the -m option, Git will launch nano (or whatever other editor we configured as core.editor) so that we can write a longer message.

If we run git status now:

$ git status

On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)

	data/
	script_template.py

nothing added to commit but untracked files present (use "git add" to track)

it tells us everything is up to date. If we want to know what we’ve done recently, we can ask Git to show us the project’s history using git log:

$ git log

commit 32b1b664a647abbbe46a12ce98b25fa2cbbb7c76
Author: Damien Irving <my@email.com>
Date:   Mon Dec 18 14:30:16 2017 +1100

    Initial commit of precip climatology script

git log lists all commits made to a repository in reverse chronological order. The listing for each commit includes the commit’s full identifier (which starts with the same characters as the short identifier printed by the git commit command earlier), the commit’s author, when it was created, and the log message Git was given when the commit was created.

Let’s go ahead and open our favourite text editor and make a small change to plot_precipitation_climatology.py by editing the description variable (which is used by argparse in the help information it displays at the command line).

description='Plot the precipitation climatology for a given season.'

When we run git status now, it tells us that a file it already knows about has been modified:

$ git status

On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   plot_precipitation_climatology.py

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	data/
	script_template.py

no changes added to commit (use "git add" and/or "git commit -a")

The last line is the key phrase: “no changes added to commit”. We have changed this file, but we haven’t told Git we will want to save those changes (which we do with git add) nor have we saved them (which we do with git commit). So let’s do that now. It is good practice to always review our changes before saving them. We do this using git diff. This shows us the differences between the current state of the file and the most recently saved version:

$ git diff

$ git diff
diff --git a/plot_precipitation_climatology.py b/plot_precipitation_climatology.
index 056b433..a0aa9e4 100644
--- a/plot_precipitation_climatology.py
+++ b/plot_precipitation_climatology.py
@@ -99,7 +99,7 @@ def main(inargs):
 
 if __name__ == '__main__':
 
-    description='Plot the precipitation climatology.'
+    description='Plot the precipitation climatology for a given season.'
     parser = argparse.ArgumentParser(description=description)
     
     parser.add_argument("pr_file", type=str,

The output is cryptic because it is actually a series of commands for tools like editors and patch telling them how to reconstruct one file given the other. If we break it down into pieces:

The first line tells us that Git is producing output similar to the Unix diff command comparing the old and new versions of the file.
The second line tells exactly which versions of the file Git is comparing; 056b433 and a0aa9e4 are unique computer-generated labels for those versions.
The third and fourth lines once again show the name of the file being changed.
The remaining lines are the most interesting, they show us the actual differences and the lines on which they occur. In particular, the + marker in the first column shows where we added a line.

After reviewing our change, it’s time to commit it:

$ git commit -m "Small improvement to help information"

On branch master
Changes not staged for commit:
	modified:   plot_precipitation_climatology.py

Untracked files:
	data/
	script_template.py

no changes added to commit

Whoops: Git won’t commit because we didn’t use git add first. Let’s fix that:

$ git add plot_precipitation_climatology.py
$ git commit -m "Small improvement to help information"

[master 444c3c0] Small improvement to help information
 1 file changed, 1 insertion(+), 1 deletion(-)

Git insists that we add files to the set we want to commit before actually committing anything. This allows us to commit our changes in stages and capture changes in logical portions rather than only large batches. For example, suppose we’re writing our thesis using LaTeX (the plain text .tex files can be tracked using Git) and we add a few citations to the introduction chapter. We might want to commit those additions to our introduction.tex file but not commit the work we’re doing on the conclusion.tex file (which we haven’t finished yet).

To allow for this, Git has a special staging area where it keeps track of things that have been added to the current changeset but not yet committed.

Staging Area

If you think of Git as taking snapshots of changes over the life of a project, git add specifies what will go in a snapshot (putting things in the staging area), and git commit then actually takes the snapshot, and makes a permanent record of it (as a commit). If you don’t have anything staged when you type git commit, Git will prompt you to use git commit -a or git commit --all, which is kind of like gathering everyone for the picture! However, it’s almost always better to explicitly add things to the staging area, because you might commit changes you forgot you made. (Going back to snapshots, you might get the extra with incomplete makeup walking on the stage for the snapshot because you used -a!) Try to stage things manually, or you might find yourself searching for “git undo commit” more than you would like!

The Git Staging Area

Let’s do the whole edit-add-commit process one more time to watch as our changes to a file move from our editor to the staging area and into long-term storage. First, we’ll tweak the section of the script that imports all the libraries we need, by putting them in the order suggested by the PEP 8 - Style Guide for Python Code (standard library imports, related third party imports, then local application/library specific imports):

import argparse

import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
import cartopy.crs as ccrs
import cmocean

$ git diff

diff --git a/plot_precipitation_climatology.py b/plot_precipitation_climatology.
index a0aa9e4..29a40fb 100644
--- a/plot_precipitation_climatology.py
+++ b/plot_precipitation_climatology.py
@@ -1,13 +1,12 @@
 import argparse
+
+import numpy as np
+import matplotlib.pyplot as plt
 import xarray as xr
-import matplotlib.pyplot as plt
 import cmocean
-import numpy as np

Let’s save our changes:

$ git add plot_precipitation_climatology.py
$ git commit -m "Ordered imports according to PEP 8"

[master f9fb238] Ordered imports according to PEP 8
 1 file changed, 2 insertions(+), 2 deletions(-)

check our status:

$ git status

On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)

	data/
	script_template.py

nothing added to commit but untracked files present (use "git add" to track)

and look at the history of what we’ve done so far:

$ git log

commit f9fb2388a096a217aa2c9e4695bf786605b946c9
Author: Damien Irving <my@email.com>
Date:   Mon Dec 18 15:43:17 2017 +1100

    Ordered imports according to PEP 8

commit 444c3c045dc69a323e40d4a04813b88e4b89e05e
Author: Damien Irving <my@email.com>
Date:   Mon Dec 18 14:59:47 2017 +1100

    Small improvement to help information

commit 32b1b664a647abbbe46a12ce98b25fa2cbbb7c76
Author: Damien Irving <my@email.com>
Date:   Mon Dec 18 14:30:16 2017 +1100

    Initial commit of precip climatology script

Exploring history

Viewing changes between versions is part of git functionality. However, for large changes it can be easier to understand what is going on when looking at them in github. We will get to that in the next lesson, but here is how you can do it on command line with git.

As we saw earlier, we can refer to commits by their identifiers. You can refer to the most recent commit of the working directory by using the identifier HEAD.

To demonstrate how to use HEAD, let’s make a trival change to plot_precipitation_climatology.py by inserting a comment.

# A random comment

Now, let’s see what we get.

$ git diff HEAD plot_precipitation_climatology.py

diff --git a/plot_precipitation_climatology.py b/plot_precipitation_climatology.
index 29a40fb..344a34e 100644
--- a/plot_precipitation_climatology.py
+++ b/plot_precipitation_climatology.py
@@ -9,6 +9,7 @@ import iris.coord_categorisation
 import cmocean
 
+# A random comment
 
 def convert_pr_units(darray):
     """Convert kg m-2 s-1 to mm day-1.

which is the same as what you would get if you leave out HEAD (try it). The real goodness in all this is when you can refer to previous commits. We do that by adding ~1 to refer to the commit one before HEAD.

$ git diff HEAD~1 plot_precipitation_climatology.py

If we want to see the differences between older commits we can use git diff again, but with the notation HEAD~2, HEAD~3, and so on, to refer to them.

We could also use git show which shows us what changes we made at an older commit as well as the commit message, rather than the differences between a commit and our working directory.

$ git show HEAD~1 plot_precipitation_climatology.py

commit 444c3c045dc69a323e40d4a04813b88e4b89e05e
Author: Damien Irving <my@email.com>
Date:   Mon Dec 18 14:59:47 2017 +1100

    Small improvement to help information

diff --git a/plot_precipitation_climatology.py b/plot_precipitation_climatology.py
index 056b433..a0aa9e4 100644
--- a/plot_precipitation_climatology.py
+++ b/plot_precipitation_climatology.py
@@ -99,7 +99,7 @@ def main(inargs):
 
 if __name__ == '__main__':
 
-    description='Plot the precipitation climatology.'
+    description='Plot the precipitation climatology for a given season.'
     parser = argparse.ArgumentParser(description=description)
     
     parser.add_argument("pr_file", type=str,

We can also refer to commits using those long strings of digits and letters that git log displays. These are unique IDs for the changes, and “unique” really does mean unique: every change to any set of files on any computer has a unique 40-character identifier. Our second commit was given the ID 444c3c045dc69a323e40d4a04813b88e4b89e05e, but you only have to use the first seven characters for git to know what you mean:

$ git diff 444c3c0 plot_precipitation_climatology.py

commit 444c3c045dc69a323e40d4a04813b88e4b89e05e
Author: Damien Irving <my@email.com>
Date:   Mon Dec 18 14:59:47 2017 +1100

    Small improvement to help information

diff --git a/plot_precipitation_climatology.py b/plot_precipitation_climatology.py
index 056b433..a0aa9e4 100644
--- a/plot_precipitation_climatology.py
+++ b/plot_precipitation_climatology.py
@@ -99,7 +99,7 @@ def main(inargs):
 
 if __name__ == '__main__':
 
-    description='Plot the precipitation climatology.'
+    description='Plot the precipitation climatology for a given season.'
     parser = argparse.ArgumentParser(description=description)
     
     parser.add_argument("pr_file", type=str,

Recovering/Reverting

All right! So we can save changes to files and see what we’ve changed—now how can we restore older versions of things? Let’s suppose we accidentally overwrite our file:

$ echo "whoops" > plot_precipitation_climatology.py
$ cat plot_precipitation_climatology.py

whoops

git status now tells us that the file has been changed, but those changes haven’t been staged:

$ git status

On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   plot_precipitation_climatology.py

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	data/
	script_template.py

no changes added to commit (use "git add" and/or "git commit -a")

We can put things back the way they were at the time of our last commit by using git checkout:

$ git checkout HEAD plot_precipitation_climatology.py
$ cat plot_precipitation_climatology

import argparse
import numpy
...

As you might guess from its name, git checkout checks out (i.e., restores) an old version of a file. In this case, we’re telling Git that we want to recover the version of the file recorded in HEAD, which is the last saved commit.

We’ve lost the random comment that we inserted (that change hadn’t been committed) but everything else is there.

plot_precipitation_climatology.py

At the conclusion of this lesson your plot_precipitation_climatology.py script should look something like the following:

import argparse

import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
import cartopy.crs as ccrs
import cmocean


def convert_pr_units(darray):
    """Convert kg m-2 s-1 to mm day-1.
   
    Args:
      darray (xarray.DataArray): Precipitation data
    
    """
    
    darray.data = darray.data * 86400
    darray.attrs['units'] = 'mm/day'
   
    return darray


def create_plot(clim, model_name, season, gridlines=False, levels=None):
    """Plot the precipitation climatology.
   
    Args:
      clim (xarray.DataArray): Precipitation climatology data
      model_name (str): Name of the climate model
      season (str): Season
      
    Kwargs:
      gridlines (bool): Select whether to plot gridlines
      levels (list): Tick marks on the colorbar    
    
    """

    if not levels:
        levels = np.arange(0, 13.5, 1.5)
        
    fig = plt.figure(figsize=[12,5])
    ax = fig.add_subplot(111, projection=ccrs.PlateCarree(central_longitude=180))
    clim.sel(season=season).plot.contourf(ax=ax,
                                          levels=levels,
                                          extend='max',
                                          transform=ccrs.PlateCarree(),
                                          cbar_kwargs={'label': clim.units},
                                          cmap=cmocean.cm.haline_r)
    ax.coastlines()
    if gridlines:
        plt.gca().gridlines()
    
    title = '%s precipitation climatology (%s)' %(model_name, season)
    plt.title(title)


def main(inargs):
    """Run the program."""

    dset = xr.open_dataset(inargs.pr_file)
    
    clim = dset['pr'].groupby('time.season').mean('time')
    clim = convert_pr_units(clim)

    create_plot(clim, dset.attrs['model_id'], inargs.season,
                gridlines=inargs.gridlines, levels=inargs.cbar_levels)
    plt.savefig(inargs.output_file, dpi=200)


if __name__ == '__main__':
    description='Plot the precipitation climatology for a given season.'
    parser = argparse.ArgumentParser(description=description)
   
    parser.add_argument("pr_file", type=str, help="Precipitation data file")
    parser.add_argument("season", type=str, help="Season to plot")
    parser.add_argument("output_file", type=str, help="Output file name")

    parser.add_argument("--gridlines", action="store_true", default=False,
                        help="Include gridlines on the plot")
    parser.add_argument("--cbar_levels", type=float, nargs='*', default=None,
                        help='list of levels / tick marks to appear on the colorbar')

    args = parser.parse_args()
   
    main(args)

Key Points

Use git config to configure a user name, email address, editor, and other preferences once per machine.

git init initializes a repository.

git status shows the status of a repository.

Files can be stored in a project’s working directory (which users see), the staging area (where the next commit is being built up) and the local repository (where commits are permanently recorded).

git add puts files in the staging area.

git commit saves the staged content as a new commit in the local repository.

Always write a log message when committing changes.

git diff displays differences between commits.

git checkout recovers old versions of files.

previous episode

Data Carpentry for Oceanographers

next episode