Starting a data storytelling project
I love many of the projects at The Pudding, and they’ve inspired me to try telling a story with data, not just make some charts and graphs. There are more than a few challenges for me to tackle, but the principal one is finding the right data set. Without my emotional investment, any storytelling project will fall flat.
Since November’s election, and especially since January, I’ve struggled to find ways to engage and act in the face of the onslaught. I recognize how privileged my existence is, and even for me, everything is overwhelming on a daily basis. I’ve read many times that it’s important to stay active and do something, no matter how small, that helps you to feel that you’re part of a solution.
So why not combine my desire to start a dataviz project with my search for meaningful activism? It should be a great opportunity to keep active, learn something new, and contribute in some small way. And I’ll certainly have a real emotional investment in the project to keep me motivated.
Finding data
I used ChatGPT’s deep research tool to see what kinds of datasets it could dig up. Here’s my initial prompt:
I want to start a data visualization and storytelling project. It’s going to focus on democracy, politics, and government, and use contemporary data. Specifically, I’m looking for a project to support democracy and the rule of law, and combat the current Trump administration, which is trampling on both since his inauguration in January 2025. I need help finding good data sets. A good data set:
- Relates to the topics of democracy, politics, government, and the rule of law
- Is publicly available and relatively well structured
- Is amenable to storytelling and visualization in service of activism
An example of a good data set would be the 5Calls.org call logs. Unfortunately, they’re not publicly available. They’d be great because the data is simple (just call time/date, representative contacted, and call result) and would allow me to connect the data to subsequent actions in Congress via their API, or to Google Trends search terms, or geography, or many other related things.
The results were pretty darn good, and one of the ideas looked particularly promising: the Crowd Counting Consortium. They’ve accumulated data on protests and activism dating back to January 2017. The data is public and grouped into three phases, and I downloaded all three in tab-separated format.
Yay—off to the races!
Reconciling coding across phases
Not quite. Each phase uses a slightly different coding; fields are inconsistently named or simply don’t appear in other sets. So my first task was analyzing the three different coding standards and trying to reconcile them.
Alongside each phase’s data, CCC publishes a coding guide in Word
format. I downloaded those as well, and then asked ChatGPT to help me
make sense of them. After some initial experimentation and iteration, I
settled on the following prompt ([...]
elides the full list
of fields for brevity here), and uploaded the three coding guides.
I’m uploading three coding guide files describing three phases of data collection for a survey. They explain how to code data for a particular phase. Note that the coding guides don’t necessarily contain each field; below is a summary of all the fields contained in each phase:
Phase 1 fields: date,locality,state,location_detail,online,type,macroevent,actors [...] Phase 2 fields: date,locality,state,location_detail,online,type,title,macroevent,organizations [...] Phase 3 fields: date,locality,state,resolved_locality,resolved_state,resolved_county,fips_code [...]
Create a table that summarizes fields across all three phases with the following columns:
- Name of the field
- Description from the coding guide(s) (summarize down to a short paragraph if needed)
- In phase 1? (TRUE/FALSE)
- In phase 2? (TRUE/FALSE)
- In phase 3? (TRUE/FALSE)
- Changes: Split from a previous field, renamed, added, deleted, etc.
Let me know if you find any fields that don’t fit into this process, e.g. a field with the same name that’s used in a significantly different way across phases.
As a sanity check, please verify that the final table has the correct number of rows. By my count, that should be 114.
I imported the resulting table into Numbers and did a bit of cleanup. Here’s a snippet:
Field Name | Short Description | In Phase 1? | In Phase 2? | In Phase 3? | Changes |
---|---|---|---|---|---|
date | Date of the event in YYYY-MM-DD format | TRUE | TRUE | TRUE | Same core meaning in all phases |
locality | Name of the city or town in which the event occurred | TRUE | TRUE | TRUE | Consistent usage |
state | Two-letter US state/territory abbreviation | TRUE | TRUE | TRUE | Consistent usage |
location_detail | Text describing the event’s specific location within the city/town | TRUE | TRUE | FALSE | Renamed to ‘location’ in Phase 3 |
online | Indicator (0/1) for online-only events | TRUE | TRUE | TRUE | Consistent usage |
type | Type(s) of protest action (e.g., protest, march, rally) | TRUE | TRUE | FALSE | Renamed to ‘event_type’ in Phase 3 |
macroevent | Identifier linking a protest with its related counter-protest(s) | TRUE | TRUE | TRUE | Usage mostly consistent across phases |
actors | Organizations and/or participant descriptions (combined) | TRUE | FALSE | FALSE | Split into ‘organizations’ and ‘participants’ from Phase 2 onward |
Importing and unifying data
I decided R was the appropriate environment for me to wrangle all these fields and unify the data. Plus, it’s the best way for me to dig into the data once it’s usable. The only catch is that I’ve never used R.
So: ChatGPT to the rescue again. I was able to iterate on code to accomplish:
- Reading the files
- Fixing import errors
- Merging and renaming fields
- Unifying tables
- Cleaning data (e.g. fixing misspellings)
Now I have a unified dataset in R that I can explore in RStudio. I can also export to a number of formats, including SQLite, to import into other tools like Datasette.
What’s next?
What are some good stories I can tell? I’m ready to explore answers to this question. I had a few initial ideas in the first prompt above, and a few more come to mind.
- Correlate protest actions with congressional action
- Find connections to contemporaneous news
- Examine geographical relationships
- Explore evolution of issue salience over time
Want to team up?
I’m comfortable with many aspects of the project, and willing to learn the others as I go, so I could continue this project solo using ChatGPT. But I’d love to include a partner in this journey. I love working in a small team to think through ideas, create, and iterate.
A partner with experience in R, data storytelling, or both, would help me get the most out of this data and tell a great story. Or somebody who’s motivated and willing to learn alongside me. If you’re interested, or know someone who is, reach out to me on LinkedIn.
I have a public Git repo documenting the process so far, including all the R scripts I developed. Check it out if you want!
Here’s hoping I can have some small positive impact in these times, which are unspeakably difficult for many both here and across the world.