What Works Cities Blog Post: Describing the data pipeline: A vocabulary for city data analysis

What Works Cities Advisor Anthea Watson Strong outlines a common vocabulary for city leaders and practitioners to use when talking about the challenges and opportunities in the data pipeline.

June 1, 2016 – Recently, I was lucky enough to attend the What Works Cities Summit, a gathering of city leaders dedicated to using data to inform their policy-making. As I walked through the conference, I tried to identify whether each new city official I met was a strategist or practitioner. Strategists include mayors, city planners, police chiefs—officials seeking to use data-driven metrics in decision-making. Practitioners focus on gathering data from noisy and imperfect real-world sources and creating usable products, like dashboards and reports, which decision-makers use to inform their work.

The city data practitioners I met were mostly self-taught. The typical career path involved starting as the database manager, munging data in the basement in the late ’80s or early ’90s. As technology and data became more important in city government, these individuals were elevated and tasked with generating the reports upon which the strategists were depending.

I recognized many of the problems practitioners were tackling. There were questions about data authority, accuracy, objectivity, and coverage, all questions I encounter in my work at Google. It was clear that practitioners had developed many inventive solutions to these challenges, but the lack of a shared vocabulary hampered their ability to discuss best practices. I saw many conversations during the conference in which it took skilled practitioners ten or fifteen minutes of discussion to even reach the point where they could meaningfully share their experiences. I started as a self-taught data munger, and I identify with the practitioners I met. However, at Google I now work in a community of data analysts and data scientists who have developed a formal framework and vocabulary for describing the problems involved in this work.

The basic framework I use to describe the data pipeline consists of four steps:

Image of flow chart depicting four steps of data pipeline
1. Ingestion refers to the process of finding data and importing it into a database, even if that’s just a spreadsheet. Ingestion can involve reformatting existing files or it can mean, in the worst-case scenario, manually transcribing data from paper to a digital format.

2. Munging and wrangling refers to the often arduous task of getting data ready for analysis. Kind of a catch-all bucket of work, data munging involves untangling all the tricky knots that inevitably form when data is not well cared for. Munging can involve parsing fields that need to be separated, correcting spelling mistakes, dealing with missing data, normalizing data, and ensuring consistency of format throughout a dataset.

3. Computation, analysis, and modeling refers to the work involved in taking cleaned-up data and generating metrics upon which decision makers rely. In the most sophisticated data science shops, this would include building predictive models that correlate data with outcomes. It can also be as simple as writing a basic a formula in a spreadsheet.

4. Reporting, the final step in the data science pipeline, is an often overlooked part of the data analysis pipeline. Helping humans understand the lessons they can derive from analyzed data, as well as the strength or weakness of the data supporting metrics, ensures all decisions based on data are more likely to produce improved performance.

Each step of this process is associated with common challenges, sources of error, and approaches to efficient handling. Adopting this framework and vocabulary would help city analysts share lessons learned, identify best practices, and form a stronger community.


Anthea Watson Strong, a member of the What Works Cities Advisory Board, is a technologist and community organizer working at the intersection of the Internet and social systems. She is part of Google’s Civics team, building products that help decision-makers govern more effectively, help people access public services more efficiently, and help users engage in the civic process. Follow her on Twitter: @antheaws.

 


Posted by Anthea Watson Strong
What Works Cities Advisor