This lesson is still being designed and assembled (Pre-Alpha version)

Data Wrangling with Computational Notebooks: Glossary

Key Points

Introduction
  • Notebooks help connect the code for cleaning and wrangling data to the documentation explaining what is being done and why.

Jupyter Notebook Interface
  • A jupyter notebook is divided into cells that are either code, markdown, or raw

  • Cells can be “run” leading to either the execution of code or formatting of markdown depending on the cell type

  • Code cells can be rerun, but this should be avoided to prevent obscuring the notebooks workflow

Loading and Handling Pandas Data
  • Pandas provides numerous attributes and methods that are useful for wrangling and analyzing data

  • Pandas contains numerous methods to help load/write data to/from files of different types

Wrangling DataFrames
  • Select columns by using ["column name"] or rows by using the loc attribute

  • Sort based on values in a column by using the sort_values method

DataFrame Analysis
  • Using .dtypes to get the types of each column in a DataFrame

  • To get general statistics on the DataFrame you can use the describe method

  • You can add a constant to a numeric column by using the column + constant

Real Example Cleaning
  • Cleaning a dataset is an iterative process that can require multiple passes

  • Keep in mind to restart the kernel when cleaning a dataset to make sure that your code encompasses all the cleaning needed.

Real Example Analysis
  • Grouping data by year and months is a powerful way to identify monthly and yearly changes

  • You can easily add more measurements to a single plot by using a list

  • There is a lot we didn’t cover here, so take a look at the Matplotlib docs (Link to Matplotlib docs) and other libraries that can allow you to make dynamic plots e.g. Plotly (Link to Plotly docs)

Glossary

FIXME