Tools and theory to improve data analysis
Doctor of Philosophy
This thesis proposes a scientific model to explain the data analysis process. I argue that data analysis is primarily a procedure to build un- derstanding and as such, it dovetails with the cognitive processes of the human mind. Data analysis tasks closely resemble the cognitive process known as sensemaking. I demonstrate how data analysis is a sensemaking task adapted to use quantitative data. This identification highlights a uni- versal structure within data analysis activities and provides a foundation for a theory of data analysis. The model identifies two competing chal- lenges within data analysis: the need to make sense of information that we cannot know and the need to make sense of information that we can- not attend to. Classical statistics provides solutions to the first challenge, but has little to say about the second. However, managing attention is the primary obstacle when analyzing big data. I introduce three tools for managing attention during data analysis. Each tool is built upon a different method for managing attention. ggsubplot creates embedded plots, which transform data into a format that can be easily processed by the human mind. lubridate helps users automate sensemaking out- side of the mind by improving the way computers handle date-time data. Visual Inference Tools develop expertise in young statisticians that can later be used to efficiently direct attention. The insights of this thesis are especially helpful for consultants, applied statisticians, and teachers of data analysis.
Data analysis; Data science; Sensemaking; Grammar of graphics; Embedded plots