Glossary

Key Points

Introduction	Good data organization is the foundation of any research project.
Formatting data tables in Spreadsheets	Never modify your raw data. Always make a copy before making any changes. Keep track of all of the steps you take to clean your data in a plain text file. Organize your data according to tidy data principles.
Formatting problems	Avoid using multiple tables within one spreadsheet. Avoid spreading data across multiple tabs. Record zeros as zeros. Use an appropriate null value to record missing data. Don’t use formatting to convey information or to make your spreadsheet look pretty. Place comments in a separate column. Record units in column headers. Include only one piece of information in a cell. Avoid spaces, numbers and special characters in column headers. Avoid special characters in your data. Record metadata in a separate plain text file.
Dates as data	Treating dates as multiple pieces of data rather than one makes them easier to handle.
Quality control	Always copy your original spreadsheet file and work with a copy so you don’t affect the raw data. Use data validation to prevent accidentally entering invalid data. Use sorting to check for invalid data. Use conditional formatting (cautiously) to check for invalid data.
Exporting data	Data stored in common spreadsheet formats will often not be read correctly into data analysis software, introducing errors into your data. Exporting data from spreadsheets to formats like CSV or TSV puts it in a format that can be used consistently by most programs.
Before we start	Python is an open source and platform independent programming language. Jupyter Notebook and the Spyder IDE are great tools to code in and interact with Python. With the large Python community it is easy to find help on the internet.
Short Introduction to Programming in Python	Python is an interpreted language which can be used interactively (executing one command at a time) or in scripting mode (executing a series of commands saved in file). One can assign a value to a variable in Python. Those variables can be of several types, such as string, integer, floating point and complex numbers. Lists and tuples are similar in that they are ordered lists of elements; they differ in that a tuple is immutable (cannot be changed). Dictionaries are data structures that provide mappings between keys and values.
Starting With Data	Libraries enable us to extend the functionality of Python. Pandas is a popular library for working with data. A Dataframe is a Pandas data structure that allows one to access data by column (name or index) or row. Aggregating data using the `groupby()` function enables you to generate useful summaries of data quickly. Plots can be created from DataFrames or subsets of data that have been generated with `groupby()`.
Data Types and Formats	Pandas uses other names for data types than Python, for example: `object` for textual data. A column in a DataFrame can only have one data type. The data type in a DataFrame’s single column can be checked using `dtype`. Make conscious decisions about how to manage missing data. A DataFrame can be saved to a CSV file using the `to_csv` function.
Indexing, Slicing and Subsetting DataFrames in Python	In Python, portions of data can be accessed using indices, slices, column headings, and condition-based subsetting. Python uses 0-based indexing, in which the first element in a list, tuple or any other data structure has an index of 0. Pandas enables common data exploration steps such as data indexing, slicing and conditional subsetting.
Combining DataFrames with Pandas	Pandas’ `merge` and `concat` can be used to combine subsets of a DataFrame, or even data from different files. `join` function combines DataFrames based on index or column. Joining two DataFrames can be done in multiple ways (left, right, and inner) depending on what data must be in the final DataFrame. `to_csv` can be used to write out DataFrames in CSV format.
Software installation using conda	install plotnine, a required pakcage for our lessons. xarray and iris are the core Python libraries used in the atmosphere and ocean sciences. Use conda to install and manage your Python environments.
Data Ingest and Visualization - Matplotlib and Pandas	Matplotlib is the engine behind plotnine and Pandas plots. The object-based nature of matplotlib plots enables their detailed customization after they have been created. Export plots to a file using the `savefig` method.
Making Plots With plotnine	The `data`, `aes` variables and a `geometry` are the main elements of a plotnine graph With the `+` operator, additional `scale_`, `theme_`, `xlab/ylab` and `facet_*` elements are added
Refresher
Data Workflows and Automation	Loops help automate repetitive tasks over sets of items. Loops combined with functions provide a way to process data more efficiently than we could by hand. Conditional statements enable execution of different operations on different data. Functions enable code reuse.
Introduction to netCDF	NetCDF is a format to store gridded data and widely use in climate science. A netCDF file contains dimensions, variables, and attributes. Xarray is a library to work with NetCDF data in Python. CMIP data is used in climate modelling.
Visualising CMIP data	Libraries such as xarray can make loading, processing and visualising netCDF data much easier. The cmocean library contains colormaps custom made for the ocean sciences.
Refresher
Functions	Define a function using `def name(...params...)`. The body of a function must be indented. Call a function using `name(...values...)`. Use `help(thing)` to view help for something. Put docstrings in functions to provide help for that function. Specify default values for parameters when defining a function using `name=value` in the parameter list. The readability of your code can be greatly enhanced by using numerous short functions. Write (and import) modules to avoid code duplication.
Vectorisation	For large arrays, looping over each element can be slow in high-level languages like Python. Vectorised operations can be used to avoid looping over array elements.
Command line programs	Libraries such as `argparse` can be used the efficiently handle command line arguments. Most Python scripts have a similar structure that can be used as a template.
Version control	Use git config to configure a user name, email address, editor, and other preferences once per machine. `git init` initializes a repository. `git status` shows the status of a repository. Files can be stored in a project’s working directory (which users see), the staging area (where the next commit is being built up) and the local repository (where commits are permanently recorded). `git add` puts files in the staging area. `git commit` saves the staged content as a new commit in the local repository. Always write a log message when committing changes. `git diff` displays differences between commits. `git checkout` recovers old versions of files.
GitHub	A local Git repository can be connected to one or more remote repositories. Use the HTTPS protocol to connect to remote repositories until you have learned how to set up SSH. `git push` copies changes from a local repository to a remote repository. `git pull` copies changes from a remote repository to a local repository.
Defensive programming	Program defensively, i.e., assume that errors are going to arise, and write code to detect them when they do. Put assertions in programs to check their state as they run, and to help readers understand how those programs are supposed to work. The `pdb` library can be used to debug a Python script by stepping through line-by-line. Software Carpentry has more advanced lessons on code testing.
Data provenance	It is possible (in only a few lines of code) to record the provenance of a data file or image.
Accessing SQLite Databases Using Python and Pandas	sqlite3 provides a SQL-like interface to read, query, and write SQL databases from Python. sqlite3 can be used with Pandas to read SQL data to the familiar Pandas DataFrame. Pandas and sqlite3 can also be used to transfer between the CSV and SQL formats.

0-based indexing: is a way of assigning indices to elements in a sequential, ordered data structure starting from 0, i.e. where the first element of the sequence has index 0.
CSV (file): is an acronym which stands for Comma-Separated Values file. CSV files store tabular data, either numbers, strings, or a combination of the two, in plain text with columns separated by a comma and rows by the carriage return character.
database: is an organized collection of data.
dataframe: is a two-dimensional labeled data structure with columns of (potentially) different type.
data structure: is a particular way of organizing data in memory.
data type: is a particular kind of item that can be assigned to a variable, defined by the values it can take, the programming language in use and the operations that can be performed on it.
dictionary: is an unordered Python data structure designed to contain key-value pairs, where both the key and the value can be integers, floats or strings. Elements of a dictionary can be accessed by their key and can be modified.
docstring: is an optional documentation string to describe what a Python function does.
faceting: is the act of plotting relationships between set variables in multiple subsets of the data with the results appearing as different panels in the same figure.
float: is a Python data type designed to store positive and negative decimal numbers by means of a floating point representation.
function: is a group of related statements that perform a specific task.
integer: is a Python data type designed to store positive and negative integer numbers.
interactive mode: is an online mode of operation in which the user writes the commands directly on the command line one-by-one and execute them immediately by pressing a button on the keyword, usually Return.
join key: is a variable or an array representing the column names over which pandas.DataFrame.join() merge together columns of different data sets.
library: is a set of functions and methods grouped together to perform some specific sort of tasks.
list: is a Python data structure designed to contain sequences of integers, floats, strings and any combination of the previous. The sequence is ordered and indexed by integers, starting from 0. Elements of a list can be accessed by their index and can be modified.
loop: is a sequence of instructions that is continually repeated until a condition is satisfied.
NaN: is an acronym for Not-a-Number and represents that either a value is missing or the calculation cannot output any meaningful result.
None: is an object that represents no value.
scripting mode: is an offline mode of operation in which the user writes the commands to be executed in a text file (with .py extension for Python) which is then compiled or interpreted to run the program. Notes that Python interprets script on run-time and compiles a binary version of the program to speed up the execution time.
Sequential (data structure): is an ordered group of objects stored in memory which can be accessed specifying their index, i.e. their position, in the structure.
SQL: or Structured Query Language, is a domain-specific language for managing data stored in a relational database management system (RDBMS).
SQLite: is a self-contained, public domain SQL database engine.
string: is a Python data type designed to store sequences of characters.
tuple: is a Python data structure designed to contain sequences of integers, floats, strings and any combination of the previous. The sequence is ordered and indexed by integers, starting from 0. Elements of a tuple can be accessed by their index but cannot be modified.

Data Carpentry for Oceanographers: Glossary

Key Points

Glossary