3.4 Data Cleaning Overview
3.4.1 Analysis Development
All data cleaning should be done in a single notebook that you clarify and expand over time. Use dplyr
, naniar
, janitor
, and sf
for data cleaning in short, well-organized pipelines.
3.4.2 General Approach
The amount of data cleaning needed will vary significantly based on the data set you are using and the measures you have selected. Thus there are no “one size fits all” instructions for cleaning your data, though you should follow the data cleaning workflow we discussed in class. In general, you will want to focus on making sure a number of criteria are met:
- Variables should have short, clear, intuitive names.
- All missing data have been recoded to
NA
values (if they were not already coded that way). - If your data set does not include identification numbers for each row, you will need to create them. The
tibble
package’srowid_to_column()
function can be used for this. - Only columns necessary for mapping should be included - subset out all other columns.
- Only observations necessary for mapping should be included - subset out all other observations (i.e. if you are using the crime data, subset out all crimes not in the category you are working with). You will also want to remove all duplicate observations.
- Identify and subset out data missing spatial references. In the crime data set, these will either be missing already (
NA
values) or have values of0
. In the CSB data set, these will either be missing already (NA
values) or have values less than800000
in thex
coordinate and less than980000
in they
coordinate.- If you are using another data source, make sure there are no missing data. The indicators above are similar to the indicators you should look for.
- Remove observations in the crime data where the count is
-1
and in the CSB data where there is a stats ofCANCEL
.- If you are using another data, make sure there are no observations for invalid data, as in calls for service that were canceled or crimes that were removed from the data set.
- Clean and modify the values of specific observations as needed.
- For example if you have a category variable - which both the crime and CSB data do, make sure it returns short, clear values. If you have multiple categories in such a variable, create a new categorical variable that simply differentiates between the relevant codes. For instance, the following values are the focus of a sample project using the CSB data. These nine categories could be summarized as (1)
stray animal, loose
, (2)stray animal
, contained, (3)animal surrender
.
- For example if you have a category variable - which both the crime and CSB data do, make sure it returns short, clear values. If you have multiple categories in such a variable, create a new categorical variable that simply differentiates between the relevant codes. For instance, the following values are the focus of a sample project using the CSB data. These nine categories could be summarized as (1)
Stray Animal | 4,102 0.57 83.25
Stray Animal Cntnd | 118 0.02 83.27
Stray Cat | 1,623 0.22 83.49
Stray Dog At Large | 9,036 1.25 84.75
Stray Dog Cntnd-ACC | 1,416 0.20 84.94
Stray Dog Contained | 939 0.13 85.07
Surrender Cat | 18 0.00 85.84
Surrender Dog | 23 0.00 85.84
Surrender Pet | 151 0.02 85.86
For projects using a non-standard data set: You want
to follow the same general process as above. You should have point data,
though they may come as .csv
or .shp
data. If
they were obtained from a geodatabase, export them to .shp
and then import them into R
. If you need to geocode the
data (i.e. you have address or city identifiers but no spatial data for
them), see Chris to discuss this process.