3.4 Data Cleaning Overview

3.4.1 Analysis Development

All data cleaning should be done in a single notebook that you clarify and expand over time. Use dplyr, naniar, janitor, and sf for data cleaning in short, well-organized pipelines.

3.4.2 General Approach

The amount of data cleaning needed will vary significantly based on the data set you are using and the measures you have selected. Thus there are no “one size fits all” instructions for cleaning your data, though you should follow the data cleaning workflow we discussed in class. In general, you will want to focus on making sure a number of criteria are met:

Variables should have short, clear, intuitive names.
All missing data have been recoded to NA values (if they were not already coded that way).
If your data set does not include identification numbers for each row, you will need to create them. The tibble package’s rowid_to_column() function can be used for this.
Only columns necessary for mapping should be included - subset out all other columns.
Only observations necessary for mapping should be included - subset out all other observations (i.e. if you are using the crime data, subset out all crimes not in the category you are working with). You will also want to remove all duplicate observations.
Identify and subset out data missing spatial references. In the crime data set, these will either be missing already (NA values) or have values of 0. In the CSB data set, these will either be missing already (NA values) or have values less than 800000 in the x coordinate and less than 980000 in the y coordinate.
- If you are using another data source, make sure there are no missing data. The indicators above are similar to the indicators you should look for.
Remove observations in the crime data where the count is -1 and in the CSB data where there is a stats of CANCEL.
- If you are using another data, make sure there are no observations for invalid data, as in calls for service that were canceled or crimes that were removed from the data set.
Clean and modify the values of specific observations as needed.
- For example if you have a category variable - which both the crime and CSB data do, make sure it returns short, clear values. If you have multiple categories in such a variable, create a new categorical variable that simply differentiates between the relevant codes. For instance, the following values are the focus of a sample project using the CSB data. These nine categories could be summarized as (1) stray animal, loose, (2) stray animal, contained, (3) animal surrender.

                Stray Animal |      4,102        0.57       83.25
          Stray Animal Cntnd |        118        0.02       83.27
                   Stray Cat |      1,623        0.22       83.49
          Stray Dog At Large |      9,036        1.25       84.75
         Stray Dog Cntnd-ACC |      1,416        0.20       84.94
         Stray Dog Contained |        939        0.13       85.07
               Surrender Cat |         18        0.00       85.84
               Surrender Dog |         23        0.00       85.84
               Surrender Pet |        151        0.02       85.86

For projects using a non-standard data set: You want to follow the same general process as above. You should have point data, though they may come as .csv or .shp data. If they were obtained from a geodatabase, export them to .shp and then import them into R. If you need to geocode the data (i.e. you have address or city identifiers but no spatial data for them), see Chris to discuss this process.