This week I read the conclusion to How Charts Lie and continued ch. 7+8 in The Truthful Art. I really enjoyed learning about Florence Nightingale and her mortality rates chart. It was useful to see the same data in the stacked bar graph because it helped me better understand the data in “the Wedges”. I’m curious as to why the later time period 1855-1856 is on the left, instead of the right- I would have thought the time would have flown logically from left to right. Her work is eye catching, unusual, but effectively presents information.
A nice little refresher on a few important principles:
“For a chart to be trustworthy, it needs to be based on reliable data.”
“A chart can be a visual argument but it’s rarely sufficient on it’s own.”
“Data and charts can save lives and change minds.”
Florence’s chart helps push conversation forward and turn words into actions. Besides just answering questions, charts should promote a curiosity. She also reminds us that the purpose behind using a chart is imperative to keep in mind.
I got a nice refresher on statistics in The Truthful Art this week. I also enjoyed learning more about frequency charts, which I feel like I haven’t seen often in the past or I’ve just passed over without giving them much thought….
As I’ve learned more about charts in the past few weeks, the ones I find the most interesting and meaningful to me are the ones that are more interactive, such as the interactive visualization about student performance in the largest cities of Ukraine. Speaking of all this learning about charts- it sure is nice seeing so many kinds of charts in real-life scenarios; I think back to math classes -especially stat class- and we learned about certain charts in theory, such as box-and-whisker, but did not see them applied in relevant or interesting ways. Or maybe that was just my teacher.
I cannot believe that the national public television channel in Spain showed such a flawed chart! Right away I figured that it had something to do with tourism jobs and the ebb and flow of tourists in Spain- February to August was too short of a duration to properly display an unemployment drop of some sort.
Mix Effects: “the fact that aggregate numbers can be affected by changes in the relative size of the subpopulations as well as the relative values within those subpopulations.” -Zan Armstrong and Martin Wattenberg. We had been discussing this topic for several chapters now, and it was nice to put a name to it.
I was super intrigued by the horizon chart…. loved the simplicity and effectiveness it had. This chart seems most appropriate for more broad generalizations.
R for Journalists ch. 5: Spatial analysis
I took a minute to explore suggested works that utilized spatial analysis and really enjoyed The Geographic Divide of Oscar Films.
A problem that has come up multiple times in the past was an error involving %>%. R studio was not recognizing this as a valid command, so I stackoverflow-ed it here: https://stackoverflow.com/questions/30248583/error-could-not-find-function/30248632 and learned that it is an extension of the package magrittr.
How to upload html file of your map into web page:
I also learned about the various third party tiles that are available through addProviderTiles() at this link: http://leaflet-extras.github.io/leaflet-providers/preview/index.html
This week I read ch. 3+4 in The Truthful Art by Alberto Cairo, and then I continued with Ch.5+6 in How Charts Lie by Cairo. The chapters in The Truthful Art this week really emphasize the reality of a visualization being a model. The way our brain processes the world around us is a similar approach- our perception forms a model that results from the mediation between our brain and the world, and it has a certain degree of accuracy. However, it should be noted that models do not oversimplify information; instead, they add clarity, which usually means that more information is brought to light. This also means that there is still hidden information.
A point that has been brought up several times in the reading is the idea that more often than not a faulty model is the result of a well-intentioned designer not paying proper attention to the data. I am guilty of assuming that faulty data is always intentional. I think that it is appropriate to analyze data visualizations with a degree of caution, but it is important to not overanalyze the intentions beyond the data visualization.
WHat you design is never exactly what your audience ends up interpreting.
^As a music major, this is obvious. When you perform in front of a crowd, you are presenting them with your interpretation of a piece of music, and, beyond that, you have no control over their reaction. I have a difficult time applying this to other areas, such as visualizations and graphic design; I want people to see the great vision that I see because in my head I selfishly think it's the b e s t. With music, it's widely accepted that performers will have different interpretations and one is not necessarily better than the other (Unless maybe your Baroque performance is super wacky and out of character, let's be honest). I find it interesting that this same view hasn't transferred over to visualizations and graphic design yet. I'm still able to take criticism for both my music interpretations and my visuals, but the underlying problem is that I still don't see my digital design work as an interpretation. I'm working on it! I digress...
It's important to consider that human survival is built on instinct, not on discovering any kind of truth. We make a series of quick and intuitive decisions that sometimes fill in information that is not present, and we like cause-and-effect explanations. While these judgements and intuitions are an important part of reasoning, there is more to it.
I completely agree with Mark Monmonier's Skills of the Educated Person. This involves four characteristics that any educated person should have in their skillset: Literacy, Articulacy, Numeracy, and Graphicacy. I think more people could get acquainted with the latter two (including myself).
I also learned more about what consists of a good conjecture: it is made of several components, and these need to be hard to change without making the whole conjecture useless. A flexible conjecture would not do well. A good conjecture typically consists of components that are naturally connected.
I never realized how problematic it could be to read storm forecast prediction maps, from reading the "cone of death" to misinterpreting the upper dotted portion of the cone as rain fall. This was my personal favorite, and I'm guilty of imagining this before self-correcting:
How NOT to read this map.
Picture taken from How Charts Lie.
The cone is an example of this key principle: the success of a chart depends on who designs it and also who reads it; the original purpose of this map was for experts like risk managers and weather forecasters.
Furthermore, it's especially good to view a chart with caution if it's a genre you are interested in; everything is just going to look better if you like that particular topic. It's important to put your bias aside as much as you can to see what data has been presented.
One last tidbit- if your purpose is to show relationships between countries and regions, then the data should be displayed on charts at the country-level or regional-level. If the purpose is to show relationships between individuals, the chart should reflect the people being represented within each country or region to each other.
I am currently working on R for Journalists Ch.4 this week.
Wow! Andrew is wearing a suit!
This tutorial finally gets into visualizing data with ggplot. It discusses how R can be used as a hands-on way to test out different visualizations for your specific data set.
I keep seeing stacked charts popping up. I find myself constantly looking at why the figures are stacked the way they are, and, like in the example in this lesson, there isn't any rhyme or reason. I think a pie chart or a bar graph can achieve the same purpose most of the time. To be continued....
This is a very nice description of each component in the ggplot function:
The illustration went on to actually include expand_limits() function, which forces the x- and y-axis to start at 0:
With the expand_limits() parameter.
Without the expands_limits() parameter. X and Y start at 20 instead of 0.
Back to stacked charts:
ggplot(data=ages, aes(x=actor, fill=Genre)) + geom_bar()
I discovered that putting a catagory in the "fill" aesthetic results in the creation of a stacked chart.
ggplot(data=ages, aes(x=actor, fill=Genre)) + geom_bar(position="fill")
^this changes the Y-axis to be percentage-based.
I'm still not a fan of the stacked chart, but at least the set proportions of the Spinogram make comparison easier between the different actors and their genres.
In general, it seems like a better idea to manipulate the visual of data as opposed to messing with the data itself. I enjoyed exploring within the geom_density() options the most. Alpha is a great option to show overlap between different categories effectively and it is also aesthetically pleasing:
What's nice about alpha in this situation is that anything lower than .5 does not visually interfere with the individual paths of the actors.
Overall this tutorial complimented the first ggplot tutorial very well. This tutorial was a nice refresher on the basics before expanding more on visualization types and also how to add multiple variables as aesthetic types. I also got more clarification and additional examples about facets and layers.
Another important thing to remember: once data is structured correctly, then you can use ggplot2 to slice, group, and facet the visual appearance of the data.
To reorder chart labels, we transform the data via the forcats package, which is part of tidyverse, using the function fct_recorder(factor to reorder, variable to reorder by, fun=function to reorder by, ..., .desc=FALSE or TRUE).
I got quite a kick out of the Mr. Roger graphic. Good instagram story material.
The "Time is a Flat Circle" installment by John Muyskens had super interesting code to look at for inspiration. However, I do not see how the circular design adds to the purpose of the visualization in any way; the data doesn't come "full circle" or anything like that:
This week I read ch. 5+6 in The Truthful Art by Alberto Cairo, and then I continued with Ch.3+4 in How Charts Lie by Cairo. It's nice knowing that "how will I know if I chose the right graphic form to represent my data?" is on the mind of even the most experienced professionals. The more I learn about graphic visualization, the more I seem to feel overwhelmed by the various choices that I have. However, reminding myself of our natural ability to identify visual patterns and outliers helps me trust my instincts. Here are some tips from Cairo:
Think about the message you wish to convey; plot what you need to plot.
Try different graphic forms.
Arrange the graphic's components so that it is simple to extract meaning.
Test the outcome yourself and with others.
Also, I appreciated the Richard H. Thaler references.... my dad just left Misbehaving on my stack of school books to read this semester.
I enjoyed exploring Severino Ribecca's Data Visualization Catalogue online. It is very user-friendly, has a nice interface and designers of any level can find its descriptions useful. I like that you can search by function or by list view, but ultimately the most useful visual of the essentials was what was derived from Cleveland and McGill's hierarchy; I like the idea of grouping the tasks based on enabling specific or more general estimates since it goes along with tip #3 from above. However, I do not think the most successful chart necessarily is defined by being at the top of the hierarchy; again, it depends on what data is being presented and what we want to stand out to viewers.
Figure 5.7, the diagram of European asylum seeker application decisions, was one of the most complex diagrams I've looked at. I understood the concept by reading the title and labels, but the first and last column involving the origin country confused me. For me, there was too much information overlapping in the same visual for me to decode at once; I would have preferred multiple visuals that had more clarity in each.
I also got a good review of our friends mode, median and mean. Resistant statistic, histograms and the weighted mean were topics I had completely forgotten about from Stat class and I appreciated the refresher. I learned that the skew of a distribution in a histogram is a good place to begin when exploring other possibilities with the data. Speaking of exploration, it's important to note that we should not rely on one statistic, chart, or map when pursuing exploratory work.
"Garbage in, garbage out". Faulty data leads to faulty visuals. The first thing to look for when reading a chart is what sources the author has identified. No credited source? Red flag. Furthermore, make sure to identify what is being counted and how it's being counted. I feel like I've already developed a more critical eye for reading charts and checking their sources, and I also know to be mindful about what charts or graphs I'm sharing online with other people.
It was interesting to read about the real reason why porn consumption per capita (provided by PornHub) in Kansas was so much higher - a glitch in the data. If people are using a VPN and location cannot be determined, geographical assignment automatically places them at the center of the contiguous United States, which is Kansas.
I also couldn't believe the similarities between the 2017 chain migration chart and the 1930s chart depicting the result of allowing the "inferior race" to multiply.
An important point to remember: any chart is a simplification of reality.
R for Journalists Ch.2+3
IMPORTING CVS FILES
Something nice about R is that no package is required for importing CSV. The package readr can still be used to assist, though. You can get data through the URL or a local file.
Base R function to import CSV: read.csv(“url”, stringAsFactors=F)
The line of code stringsAsFactors=FALSE is needed because data is read as factors by default, but need to be treated as strings instead.
The plus side of using readr package is that it assumes that characters are strings and not factors.
When you are done manipulating the data, save your dataframe as a CSV file with write_csv() from the readr package:
Example: write_csv(df_csv, "transformed_data.csv")
Most likely the exported CSV file will contain NAs, so we need to replace these with blanks, which can be accomplished with the code found below:
Example: write_csv(df_csv, "data/transformed_data.csv", na="") oreplaces NAs as blanks
When importing excel files, the Readxl package needs to be installed in R. Also, the excel sheet needs to be downloaded locally before exportation to R; it can't just be accessed from the online server.
Example: drug_deaths <- read_excel(“local source here”, sheet=1, skip = 2)
^It's important to always specify what sheet number you are working with, and, if skipping rows, the number of rows to skip.
There are two ways to get rid of NAs (Missing data). Subset () is a function of base R. The alternative is filter(), which is a function of the library dplyr.
Base R example: drug_deaths <- subset(drug_death, !isna(Year))
Dplyr example: drug_deaths <- filter(drug_death, !is.na(Year))
CSV > EXCEL
When importing data, csv is preferred to excel; simple + more compatible. However, it can only import one sheet at a time and it does not include formatting.
My preliminary ponderings on delimited pipe files were the following:
"read_tsv() vs read_delim()? What the hell is a delimited pipe file"
Upon further research, I found a very good explanation:
"A pipe delimited text file has this character “|” as a separation character between data fields.
John|Smith|100 n main street, apt 1|555–555–5555|City, State|
Do you see why pipe may be better than a comma when separating fields? It’s a character you have to enter in the other character option when importing data into Excel or Access. It is a less commonly used punctuation mark so fairly safe to use as a separator character." -Julie Frey from Quora
Suuuper straightforward to use Json data: You just grab data from online and then use jsonlite library.
1. Grab URL for data:
2. Save URL data in a variable:
stations <- fromJSON(json_url)
3. Viewing. Notice the difference that a data frame makes with the data appearance below:
Imported data (as is):
Imported data, as-is.
Imported data, with data frame.
SPSS stands for Statistical Package for the Social Sciences and is owned by IBM. It’s also very expensive and usually only large businesses or organizations own licenses. It provides a graphical interface useful for even deeper analysis.
Copying and pasting data: tibble is more forgiving with bringing in information with odd characters. Use it.
https://github.com/andrewbtran/muckrakr = super helpful for combining spreadsheets from the same folder.
str_sub(strings, start, end) extracts and replaces substrings.
The only problem I encountered was towards the end of Ch. 3, when I received the notification on R that Package “glue” not available (as a binary package for R version 3.5.1.) in dealing with the dates section of the tutorial. I assume there was an update and the library wasn't compatible with the update?
This week I read the intro and chapters 1 + 2 in The Truthful Art by Alberto Cairo. A big takeaway for me was the idea that infographics and data visualizations should exist to inform people, not merely for the purpose of entertainment or selling them anything. As Cairo says, the main goal should be "to increase society's collective knowledge". We see so many pointless pictures, stories, posts, statuses, etc. that don't move society forward in any way. What if more people kept this thought in mind while they put themselves on the internet? (Not that this would ever happen, but can you imagine what that would be like?) There were several basic definition that I was glad to review, such deceit not necessarily being conscious. Even the super basic terms of this course- I wouldn't have been able to put a very specific definition to them. I'll keep them here as a reminder to myself- in Cairo's words:
Visualization: any kind of visual representation of information designed to enable communication, analysis, discovery, exploration, etc.
Chart: Display in which data are encoded with symbols that have different shapes, colors, or proportions.
Infographic: Multi-section visual representation of information intended to communicate one or more specific messages.
Data Visualization: Display of data designed to enable analysis, explorations, and discovery.
A big realization for me was that data visualizations aren't supposed to inform viewers of predetermined conclusions; viewers are left to come to various conclusions on their own. Another term I had never heard of before was a News Application, in which data can be uniquely customized to each individual. A good example of this was the "Health Care Explorer". I was most interested in the multimedia elements of the "Beyond the Border" project. I liked the interactiveness of the project, and the different layers really helped the viewer understand the information being presented to them. The design is clean, simple, and does not distract.
When reading about successful visualizations, I was especially drawn to the section that discusses beauty. Cairo mentions this famous quote by Roger Scruton: "Art moves us because it is beautiful, and it is beautiful because it means something. It can be meaningful without being beautiful; but to be beautiful it must be meaningful." As an undergrad music major, I also think about the application of the statement to the performance of classical music. In both visualizations and music performances, the beauty of the product is created by a widespread audience experiencing is as beautiful.
I also read the intro and chapters 1 + 2 in How Charts Lie by Alberto Cairo. Of course, for me and I'm sure a lot of other people, the first thing that came to mind when reading the title of this draft was politics. A scary reality that Cairo addresses is that as ideology polarity increases, so does the divide in trusting/mistrusting information presented in charts.
I think it was important to point out the flaw in Paul Krugmann's chart about the US murder rate- why only post through 2014? Cairo points out to "never attribute to malice what could be more easily explained by absent-mindedness, rashness or sloppiness". With a track record like Krugmann's, it seems unusual that this was overlooked. The debate then turns into what information is considered relevant. After learning of the actual percentages that were being compared in the tax rate chart produced by Fox News in 2012, I just can't imagine being okay with creating such a gross exaggeration as this. It makes me wonder what the employer's specific instructions were for creating this graph... However, in a way, this is an example of what one person considered to be relevant data.
In general, a big takeaway for me was realizing that all charts can be misleading, even the best ones. I really liked the tricks that were listed to assist in dissecting a chart: looking at scale labels, dispersion of data, thinking about the chart having imaginary quadrants, and drawing a line to mark the apparent relationship between x and y.
I had never heard of a treemap before, and I find them extremely disorienting. I understand that the size of the rectangle is associated with size, but why is each specific rectangle where it is in the chart? Why isn't Mexico below the US, and Brazil below that? It seems that a pie chart is an easier and more familiar way to understand the same information.
"A picture is worth a thousand words"... if you know how to read it and interpret it!
^Love that addition. It's especially relevant with how much people skim content these days.
I also really appreciated the glimpse at William Playfair's Atlas. What is easily recognizable today as a line chart was the first of its kind in the year 1785. As somebody who is interested in the study of ancient music, I am used to studying oral traditions that appeared far earlier than the written word. I am curious about verbal descriptions that must have happened prior to Playfair that eventually evolved into the visual aspect of charts, and also how the system of the number chart evolved and influenced this.
Here is the basic scatterplot for ggplot(data=mpg). Geom_point() is the basic function that adds points to the graph to create the scatterplot.
I spent time playing around with the color aesthetic. This graph is the product of:
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
Notice how the color is listed outside of the aesthetic values, resulting in the color being applied universally.
Compared to Figure 2, color value placement is inside the parentheses, which adds another variable to the chart:
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = drv))
One of the exercises asked for the advantages and disadvantages of a subplot. On one hand, it seems easier for viewers to understand various layers of data. However, If a data set was quite large, it seems like many subplots would be overwhelming for the viewer.
I also learned about geoms and how multiple geoms can be used to represent data in a plot. They are represented by geometrical objects, and the data is typically grouped automatically for discrete variables.
Mappings can create different layers within a geom function. This allows different aesthetics in different layers. The first set of code places both geoms in the same function:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point() + geom_smooth()
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + geom_smooth(mapping = aes(x = displ, y = hwy))
I then tinkered with coord_polar. I was a little confused why x= factor had to have either a length of one or the number of values in the chart. Still researching, I'm sure it will make more sense the more I learn:
ggplot(data = mpg, mapping = aes(x = factor(1), fill = class)) + geom_bar(width = 1) + coord_polar(theta = "y")
Another area I need more clarification on is stats. I understand stat_summary(), but I do not understand why geoms and stats are generally interchangeable, or why stats are used for overriding default stats and transformed variables. Will research this later in the week.
More discoveries: Concatenate
Charts can be created in R with 2 lines of code:
x <- rnorm(100) -assigns 100 random numbers to x-
plot(x) -basic function to display x-
The Race column below is recognized as a factor variable with three levels- black, hispanic and white. Factor variables are categorical and consist of numerical and string values. I think they are mostly used for statistics, so not sure how much journalists would actually use factor variables.
Time/Date in R
In short: it's complicated. How is R supposed to know that 3.00 comes after 2.59? It makes sense that it wouldn't make sense to the software. strptime() and a vector consisting of strings. library(lubridate) is simplest way to deal with this. Helpful to know: ymd_hms() converts the year, month, date and hour, minutes, and seconds.
addProviderTiles() - Uses the Leaflet Providers plugin
setView() - sets the starting position with a specific zoom level
addLegend() - same as before, but more customizable