This week I continued with the Tafte. I learned about rugplots and the one in the book made no sense, so I looked elsewhere. All of the rugplots I found (besides actual, literal rugs) were more straightforward in demonstrating the marginal distributions of a variable among x and y.
Something very interesting to me is the concept of stem-and-leaf plots; I never noticed that they construct the distribution of a variable with the numbers themselves. That is so effective…. I also loved reading the shaped poem “Easter Wings” and observing how the length of the lines depict quantity.
My favorite graph from this chapter was the chart that showed how states once differed in their engineering standards for painting lane stripes on road pavement. I think this would be a fun interactive, visual graph to recreate today.
I also took a stab at Interactive Data Visualization for the Web a bit more, but am still unimpressed. I felt that it didn’t go much in depth into any area, and as a result was hard to apply to my work. I found the D3 online tutorials to be helpful, though.
I took another look at the I’m Not Feeling Well visual essay by Gabriel Gianordoli, as it seemed to be the most similar to my project. I used the simplicity of his graphs as inspiration for mine. He includes trend and seasonal graphs, which have proven to be very difficult areas for me…. something to aspire to, though.
This week I transitioned over from R into D3. I did experiment a bit in adding a trend line in R, shown below:
I found a great library called Dimple that simplifies D3 into fewer lines of code. It seems like a nice introduction into D3. Here is a link to the library. Here is a simple example of what I was able to produce with the library:
*With missing title and legend* Here is what the graph looks like upon first glance.
Here is the animation that happens when you hover over the point.
Below are the additional graphs I have created so far with the library. I am still working out some bugs with the hover action, but I am pleased with how fast I was able to learn the library.
Here is the basic code I used to create the graph on the right:
I also changed the color scheme and made adjustments to the layout. I think I found a header design:
Made quite a bit of progress this week. And also took several steps backwards. But all in all, a good week of exploration. Website layout updated (read from L to R):
Instead of requesting specific pieces from Google (original request), I instead requested a general search query for “classical music” from 2004-present. This will be the first graph I explore, which will be supported by various statistics and facts from other sources (examples would include ticket sales, orchestra size, concert season length, etc.). My hypothesis is that there will be a statistically significant decrease in the trend. From there, we will explore three of the top household composer names: Bach, Mozart and Beethoven (a nice graphic will go where those ovals are). The graph will explore their overall yearly popularity from 2004-present, and observations/exploration of various spikes in the data will follow. The next section will look at the same data, but more specifically at the daily level. According to my initial Google Trends research, searches for these composers spiked highest (and lowest) during the night, as people search for music to help them wind down for bed and also perhaps when they wake up in the middle of the night.
The last section explores how these same three composers have been programmed as part of orchestral concerts. Key word: explore. I found this great and credible library provided by the NY Philharmonic, found here, which lists every single performance piece since the early 1800s that the orchestra played. My initial exploration (no way does this represent how I would formally present this data):
I chose the two seasons (2004-2005 and 2016-2017) based on the years of complete data I will have from Google for the earlier graphs. I haven’t yet determined any statistical significance yet but off the top the only points of interest I have found so far include:
Number of concerts (full orchestra and chamber music): 108 (2004) vs. 136 (2016)
Percentage of Bach, Mozart and Beethoven included in concerts: 30% (2004) vs. 35%(2016)
Within this smaller percentage (2004 vs. 2016):
Beethoven: 46% vs. 46%
Mozart: 17% vs. 34%
Bach: 37% vs. 20%
Percentage of pieces by composers born in 20th or 21st century only: work in progress.
I need to perform analysis on more orchestra seasons, and I was curious about doing one from the 90s next (12 yrs earlier would be 1992). On my to-do list. I’m also analyzing how much “new music” is in each season’s programming (“new music” meaning works by any composer born in the 20th or 21st century).
Also got some good illustrations coming in:
My website is fully coded and responsive, using HTML and CSS grid. Still figuring out how to decompose on R, and also reading Interactive Data Visualization for the Web by Scott Murray to learn more about d3.pie.
I arrived at Tufte’s famous duck analogy. I have to say, it is so tempting to initially think of graphical style over the data that’s actually being presented…. I have a great plan for using a music staff to present data someday, or perhaps in the layout of how the instruments are arranged in an orchestra…. but I just have to wait and see what the data tells me someday. A good quote to keep in mind and applies well to graph building: “It is all right to decorate construction but never construct decoration.”
I also had a look at some truly awful 3D graphs, including what Tufte thinks could be “the worst graphic ever to find its way into print”. That made me laugh. Also gotta love the 20th century invention of the vibrant, cross-hatching patterns.
I also read The Pudding’s Where Slang Comes From, and I love the simplicity of this graph-
I think it would be a great display for my monthly/yearly data involving Bach, Beethoven and Mozart. I also learned a lot of new slang. Which got me thinking…. what are people’s takeaways going to be from my project?
This week I made the decision to explore/display my data with R, as opposed to D3, since I 1) want to get more familiar with R 2) My project’s data requires simple, clean line graph displays. I had stared reading Interactive Data Visualization for the Web and worked on the D3 tutorials, but my main focus lately has been decomposing time-series graphs using R.
So far these links have proven to be quite useful:
I understand the theory behind the code, but am still having trouble implementing the ideas into my specific coding set. I am particularly having a difficult time declaring periods:
I need to figure out how to declare my periods… I would have thought it would have been obvious due to my monthly data from 2004-2017, but that is not the case.
I also am close to finishing my basic html/css page for the visual essay; just working on the responsiveness and it it will be finished. I have a friend who is doing the illustrations for me, which I think will really tie together the webpage into something nice. Inspiration I gave her (aka horrible sketch by me):
I’ve settled on a potential layout for the website, which looks at the data more broadly initially before breaking down into yearly, monthly, daily. Currently working on gathering supplementary information and statistics about classical music popularity, orchestra popularity, etc. that I think will pair nicely with what I’m presenting in the graphs.
I continued The Visual Display of Quantitative Information by Edward R. Tufte for my reading this week. I started reading about time-series displays, which is super relevant to what I’ve been researching for my project. There were several visuals that I started to think about how they could be “updated” in a sense with the technology we have today so that they are more interactive:
This was a great display of air pollution over time. I was surprised at how easy it was to look from graph to graph and keep track of the data. Today, this graph could be drawn as just one landscape, with a slider controlling what time the viewer is observing. This would add an interactive component, but I suppose it also would prevent the viewer from immediately drawing any conclusions since all the data isn’t presented to them at once…. pros and cons.
The end of the chapter had an interesting quote that I’m not sure I agree with- “Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space.” I don’t think it’s necessarily about having the highest number of ideas in the least amount of time; it seems to me it should partially revolve around the ability to come up with substantial ideas and conclusions about the data.
The other chapter I read in the book focused on what Tufte calls “Graphical Integrity”. He first talks about John Turkey’s role in the 1960s, and how he made graphs “respectable” by “putting an end to the view that graphics were only for decorating a few numbers”. It was interesting to learn about the change focus of graphs during the first half of the twentieth century (design vs. number content), and also getting more examples of graphic distortion, especially in examples that use the third dimensional plane as extra depth. One of my favorite word choices of Tufte’s: “Chartjunk heap”. I’m going to steal that. Here are a few more examples of graphs I really liked:
I loved this depiction of the history of the growth of the Italian post office as shown through the number of postal savings books issued. I thought a circular shape was very affective in term of showing yearly patterns.
A bit comical, but still affective. I also am not sure how mathematically accurate the drawings are, but I like the idea, especially with the people there for scale.
I continued next with the chapter called Sources of Graphical Integrity and Sophistication. It’s the first chapter I feel that Tafte brings up some questionable opinions with not so great evidence.
Those who get ahead are those who beatify data, never mind statistical integrity.
Inept graphics flourish because many graphic artists believe that statistics are boring and tedious….and all too often decorated graphics pep up, animate and exaggerate what evidence there is in the data.
As art bureaucracy grows, style replaces content.
These ideas, which Tafte likes to present as fact, overgeneralize the entire industry and were cringe-worthy to read. One point that he made that I agree with though is that if the stats are boring then you’ve got the wrong numbers.
I also took time to read the visual essay called Film Dialogue on The Pudding. 2,000 screenplays were broken down by gender and age. They start off by showing the break down of Disney movies, which I think was a genius way of drawing people into the story (wow, men have more than 60% of the dialogue). I liked the categories that the essay then branches into- Methodology, and Why we made this.
This week I read Can we talk about the gender pay gap? by Xaquín G.V. in preparation of his presentation on Thursday. I also started reading The Visual Display of Quantitative Information by Edward R. Tufte.
In chapter 1, six graphs are displayed that report the age-adjusted death rate from various types of cancer for all 3,056 counties of the United States. Right away I am able to recognize a few short fallings on the map:
Data is attached to geographical regions, and this can be misleading: at first glance, it looks like generally more people die from cancer in the northeast, when in reality this graph just doesn’t take population density into consideration, which is the real reason why more people appear to be dying from cancer in these areas.
Death certificate reports is the sole data source, which is subjective and could lead to differing opinions concerning the origin site of the cancer.
I really enjoyed learning about the history of data maps. An example of early cartographic work was from eleventh century A.D. in China and exhibited extraordinary attention to detail. The display includes river systems and depictions of the Chinese coast. It is evidence of how much more advanced Chinese geography was for its time compared to Western European technique, which did not produce similar results until around 1550.
Fun fact I picked up: the first economic time-series plot was not created until 1786; measure quantities just didn’t exist in map design prior to that time. There was an early tenth or eleventh century illustration of the paths of the planetary orbits that attempts some kinds of graphical means of displaying changing values, but it is difficult to extract concrete meaning from.
Many of these visuals from centuries ago portray qualities that data visualization designers still put into practice. Tufte says about one map, “Notice how quickly and naturally our attention has been directed toward exploring the substantive content of the data rather than toward questions of methodology and technique.” This observation stands firmly also for Dr. John Snow’s famous 1854 dot map. Deaths from cholera in central London were marked on a map with dots, while local water pumps were marked with a cross. The findings were that death by cholera occurred highest near the Broad Street water pump, and this illustration helped put an end to the neighborhood epidemic which had taken more than 500 lives. People of various professions and backgrounds were able to look at a map such as this and draw similar conclusions from its data.
Charles Joseph Minard’s 1864 world map portraying the exports of French wine shows the integration of including quantity. Not only does his illustration show the journey of the French wine on an international scale, but it shows general quantity exported to different locations. Perhaps not possible at the time, but it would have been interesting to see this map include the vineyard origins of the wine; instead of illustrating the wine with a general origin of France, depict the actual vineyard of origin for the wine exports and see how the different vineyards get internationally distributed.
Tufte shows several graphs and accompanying articles that explain the same information depicted in the graphs. In one particular example, it took a 700-word article to convey the same material as the graph.
I absolutely loved Xaquín‘s pay gap for every occupation visualization: it literally took into consideration the pay gap, which was fascinating to look at on an illustration. I learned a lot from the structure of the article as well, as it not only helped clarify the visualizations that were included, but it also expanded into additional studies and information with seamless transitions between everything.
This week I am finishing The Truthful Art and I also read a few visual essays by Matt Daniels in anticipation of his class presentation.
I specifically looked at Newspapers: A Black and White Issue which displays the racial diversity of the American newsroom. I am trying to read as many visual essays as I get ready to launch my own. With a background also in journalism, I think that visual essays combines the two areas that I currently find of interest- one old and one new (visualization). In this article, I paid close attention to the organization of the essay. I assume that the first visualization of the diversity of the newsrooms that have a staff higher than 25 people is the main visual. I like that the essay then goes more in-depth with analyzing that display before naturally moving on to exploring other questions and related data. Which leads me to my question- were the later visualizations inspired by the findings in the initial visual, or were they always part of the layout for this essay?
In addition to reading visual essays I continued in The Artful Truth with ch.11 + 12. I got a nice refresher on confidence interval, distribution, z-score and standard error. After not taking a math course in five years, it’s really nice going back to numbers and mathematical calculations. I bet if I took calculus, stats, etc this year instead of earlier, I would have a brand new appreciation and motivation to dive into it all! Anyway…
The best advice you’ll ever get: ASK.
I’m already developing this in my own analyzation of charts as I ask where the data came from, is it reliable, is there a source, what is the purpose, etc. I am also putting this into play by reaching out to various organizations for data concerning my class project. I was super interested in a visual representing gender makeup in the top symphonic orchestras in the US. If I wanted data just from this current season, I could literally go onto the websites of each individual orchestra and look at the roster for each instrument section. However, using this method I do not have access to older information. The League of American Orchestras DOES have this information though. 800 orchestras belong to this organization and submit data from everything to finances and fundraising to gender and race. The league occasionally releases reports that analyze various aspects of the orchestras, but what they release is selective, and I had a hard time seeing much research released having to do with the gender makeup data (which I know exists). This organization would be the perfect way to access data from the last ten or twenty years, right? I reached out and explained my purpose and was told that only society members have access to the data. UGH! It would be have been so interesting- definitely an exploration for another time when I make a direct connection to orchestra administrators someday. Anyway….
Something that was eye-opening for me in ch.12 was the misconception about how to understand visualizations; it is important to recognize that visualization, as Alberto Cairo says, “is not meant to just be seen… but to be read, like written text.” It also seems that there is an art to writing the explanatory captions that accompany charts.
Comments on specific Visuals:
I loved the visualization “Workers’ Comp Benefits: How Much is a Limb Worth?” It’s something people think about, but we never see hard statistics of it. Seeing this graph might also make a worker be more careful on the job (such as my cousin, a carpenter who has lost three fingers at this point).
The watercolor-esque map of the 2014 midterm elections was especially disorienting for me since it utilized a bivariate scale. After taking time to analyze it, however, using colors to represent two variables was extremely effective.
I really enjoyed Kim Albrecht’s Culturegraphy project. I showed this to a few movie lover friends of mine and they got a kick out of it. The display for this chart is perfect for the content. It makes me wonder what other displays could have worked for this data.
I continued ch.9+10 in The Truthful Art and read a post by Accurat. I got a huge kick out of the Spurious Correlations page of Tyler Vigen’s site, especially the graph depicting how US spending correlates with suicides by hanging. I have a group of people that I’ve been sending tidbits of info from this class, and this page definitely made the list. (One of my friends just got back to me and said “I never thought I’d want to take a class having to do with graphs, but you are making a convincing case with everything you are sending”)
But it brings up a point that seems obvious in hindsight but I feel has never been outright addressed in any of my classes until now- there are lurking variables that we are overlooking. The influence that different variables have on each other is complicated.
We usually talk correlation vs. causation, but ch. 9 brought up an earlier stage to consider- association vs. correlation. Two variables are related when changes in one are accompanied by variations in the other. We start talking about correlation when the relationship between two variables is linear.
The NY Times chart by Hannah Fairfield and Graham Roberts was a stunning visual in its simplicity. For me, it was a great reminder that less can indeed be more if you just let the graph (in this case, a scatterplot) do what it was made to do. The Death of a Terrorist graphic was also stunning- it shows me how graphs can effectively display emotions, which is not something I would traditionally pair together in the same sentence. I love that the NYTimes decided to chart reader response in this manner as opposed to just summing up everything in a few sentences of text. Visuals like these speak for themselves and create such a lasting impression on the viewer.
Ch. 10 provided me with a great list of projection suggestions and brought to light that the Mercator is typically the default projection that online tools use. When picking maps, don’t think “good” vs. “bad” but instead “appropriate” and “inappropriate”.
I also really enjoyed playing around with Jason Davies’ Voronoi map. I wasn’t expecting to also be able to zoom in on the dots, which was a nice surprise.
In anticipation of our session with Giorgia Lupi coming up soon, I decided to check out several Accurat projects, most specifically their “Data ITEMS: A Fashion Landscape” MoMA project from 2017. While supplementary to the museum’s exhibit on the most influential clothing and accessories of the 20th and 21st centuries, the visualization took on its own investigation into the qualitative and quantitative characteristics of these items of clothing and their place in evolving society. All 111 items were represented, and the visual team showcased 8 specific items within the larger collection. I wonder what would have happened if the visualization research was happening at the same time as the original exhibition research- would the visualization team have had influence overall content in the original collection or even the direction of the exhibition itself? For instance, would the exhibit have highlighted those same 8 pieces that the visual did? I admire the more artistic approach of this visual- something to ask Giorgia: if you have an idea that seems a little “out there”, are there any particular strategies that you implement in your pitch to get other people onboard?
I found myself rereading this quote from the end of the article several times and wanted to store it here for my own sake-
“It has been an exercise in exploring and depicting the richness and depth of soft data. Many of our most recent projects try to explore how we can adopt a broader definition of data, encompassing also the more subjective, intimate, personal and therefore imperfect aspects of the information systems that we live in. When working with data, so often we tend to focus only on the hard numbers that are readily available to us, without realizing they can actually become much more meaningful if we are able to unearth a more nuanced and expressive type of data along with them. Enriching hard data by combining it with an additional layer of “softer” and more qualitative information is in many situations a successful strategy to provide a very much needed context, that helps decoding and interpreting the most complex and multifaceted scenarios.”
This week I read the conclusion to How Charts Lie and continued ch. 7+8 in The Truthful Art. I really enjoyed learning about Florence Nightingale and her mortality rates chart. It was useful to see the same data in the stacked bar graph because it helped me better understand the data in “the Wedges”. I’m curious as to why the later time period 1855-1856 is on the left, instead of the right- I would have thought the time would have flown logically from left to right. Her work is eye catching, unusual, but effectively presents information.
A nice little refresher on a few important principles:
“For a chart to be trustworthy, it needs to be based on reliable data.”
“A chart can be a visual argument but it’s rarely sufficient on it’s own.”
“Data and charts can save lives and change minds.”
Florence’s chart helps push conversation forward and turn words into actions. Besides just answering questions, charts should promote a curiosity. She also reminds us that the purpose behind using a chart is imperative to keep in mind.
I got a nice refresher on statistics in The Truthful Art this week. I also enjoyed learning more about frequency charts, which I feel like I haven’t seen often in the past or I’ve just passed over without giving them much thought….
As I’ve learned more about charts in the past few weeks, the ones I find the most interesting and meaningful to me are the ones that are more interactive, such as the interactive visualization about student performance in the largest cities of Ukraine. Speaking of all this learning about charts- it sure is nice seeing so many kinds of charts in real-life scenarios; I think back to math classes -especially stat class- and we learned about certain charts in theory, such as box-and-whisker, but did not see them applied in relevant or interesting ways. Or maybe that was just my teacher.
I cannot believe that the national public television channel in Spain showed such a flawed chart! Right away I figured that it had something to do with tourism jobs and the ebb and flow of tourists in Spain- February to August was too short of a duration to properly display an unemployment drop of some sort.
Mix Effects: “the fact that aggregate numbers can be affected by changes in the relative size of the subpopulations as well as the relative values within those subpopulations.” -Zan Armstrong and Martin Wattenberg. We had been discussing this topic for several chapters now, and it was nice to put a name to it.
I was super intrigued by the horizon chart…. loved the simplicity and effectiveness it had. This chart seems most appropriate for more broad generalizations.
R for Journalists ch. 5: Spatial analysis
I took a minute to explore suggested works that utilized spatial analysis and really enjoyed The Geographic Divide of Oscar Films.
A problem that has come up multiple times in the past was an error involving %>%. R studio was not recognizing this as a valid command, so I stackoverflow-ed it here: https://stackoverflow.com/questions/30248583/error-could-not-find-function/30248632 and learned that it is an extension of the package magrittr.
How to upload html file of your map into web page:
I also learned about the various third party tiles that are available through addProviderTiles() at this link: http://leaflet-extras.github.io/leaflet-providers/preview/index.html
This week I read ch. 3+4 in The Truthful Art by Alberto Cairo, and then I continued with Ch.5+6 in How Charts Lie by Cairo. The chapters in The Truthful Art this week really emphasize the reality of a visualization being a model. The way our brain processes the world around us is a similar approach- our perception forms a model that results from the mediation between our brain and the world, and it has a certain degree of accuracy. However, it should be noted that models do not oversimplify information; instead, they add clarity, which usually means that more information is brought to light. This also means that there is still hidden information.
A point that has been brought up several times in the reading is the idea that more often than not a faulty model is the result of a well-intentioned designer not paying proper attention to the data. I am guilty of assuming that faulty data is always intentional. I think that it is appropriate to analyze data visualizations with a degree of caution, but it is important to not overanalyze the intentions beyond the data visualization.
WHat you design is never exactly what your audience ends up interpreting.
^As a music major, this is obvious. When you perform in front of a crowd, you are presenting them with your interpretation of a piece of music, and, beyond that, you have no control over their reaction. I have a difficult time applying this to other areas, such as visualizations and graphic design; I want people to see the great vision that I see because in my head I selfishly think it's the b e s t. With music, it's widely accepted that performers will have different interpretations and one is not necessarily better than the other (Unless maybe your Baroque performance is super wacky and out of character, let's be honest). I find it interesting that this same view hasn't transferred over to visualizations and graphic design yet. I'm still able to take criticism for both my music interpretations and my visuals, but the underlying problem is that I still don't see my digital design work as an interpretation. I'm working on it! I digress...
It's important to consider that human survival is built on instinct, not on discovering any kind of truth. We make a series of quick and intuitive decisions that sometimes fill in information that is not present, and we like cause-and-effect explanations. While these judgements and intuitions are an important part of reasoning, there is more to it.
I completely agree with Mark Monmonier's Skills of the Educated Person. This involves four characteristics that any educated person should have in their skillset: Literacy, Articulacy, Numeracy, and Graphicacy. I think more people could get acquainted with the latter two (including myself).
I also learned more about what consists of a good conjecture: it is made of several components, and these need to be hard to change without making the whole conjecture useless. A flexible conjecture would not do well. A good conjecture typically consists of components that are naturally connected.
I never realized how problematic it could be to read storm forecast prediction maps, from reading the "cone of death" to misinterpreting the upper dotted portion of the cone as rain fall. This was my personal favorite, and I'm guilty of imagining this before self-correcting:
How NOT to read this map.
Picture taken from How Charts Lie.
The cone is an example of this key principle: the success of a chart depends on who designs it and also who reads it; the original purpose of this map was for experts like risk managers and weather forecasters.
Furthermore, it's especially good to view a chart with caution if it's a genre you are interested in; everything is just going to look better if you like that particular topic. It's important to put your bias aside as much as you can to see what data has been presented.
One last tidbit- if your purpose is to show relationships between countries and regions, then the data should be displayed on charts at the country-level or regional-level. If the purpose is to show relationships between individuals, the chart should reflect the people being represented within each country or region to each other.
I am currently working on R for Journalists Ch.4 this week.
Wow! Andrew is wearing a suit!
This tutorial finally gets into visualizing data with ggplot. It discusses how R can be used as a hands-on way to test out different visualizations for your specific data set.
I keep seeing stacked charts popping up. I find myself constantly looking at why the figures are stacked the way they are, and, like in the example in this lesson, there isn't any rhyme or reason. I think a pie chart or a bar graph can achieve the same purpose most of the time. To be continued....
This is a very nice description of each component in the ggplot function:
The illustration went on to actually include expand_limits() function, which forces the x- and y-axis to start at 0:
With the expand_limits() parameter.
Without the expands_limits() parameter. X and Y start at 20 instead of 0.
Back to stacked charts:
ggplot(data=ages, aes(x=actor, fill=Genre)) + geom_bar()
I discovered that putting a catagory in the "fill" aesthetic results in the creation of a stacked chart.
ggplot(data=ages, aes(x=actor, fill=Genre)) + geom_bar(position="fill")
^this changes the Y-axis to be percentage-based.
I'm still not a fan of the stacked chart, but at least the set proportions of the Spinogram make comparison easier between the different actors and their genres.
In general, it seems like a better idea to manipulate the visual of data as opposed to messing with the data itself. I enjoyed exploring within the geom_density() options the most. Alpha is a great option to show overlap between different categories effectively and it is also aesthetically pleasing:
What's nice about alpha in this situation is that anything lower than .5 does not visually interfere with the individual paths of the actors.
Overall this tutorial complimented the first ggplot tutorial very well. This tutorial was a nice refresher on the basics before expanding more on visualization types and also how to add multiple variables as aesthetic types. I also got more clarification and additional examples about facets and layers.
Another important thing to remember: once data is structured correctly, then you can use ggplot2 to slice, group, and facet the visual appearance of the data.
To reorder chart labels, we transform the data via the forcats package, which is part of tidyverse, using the function fct_recorder(factor to reorder, variable to reorder by, fun=function to reorder by, ..., .desc=FALSE or TRUE).
I got quite a kick out of the Mr. Roger graphic. Good instagram story material.
The "Time is a Flat Circle" installment by John Muyskens had super interesting code to look at for inspiration. However, I do not see how the circular design adds to the purpose of the visualization in any way; the data doesn't come "full circle" or anything like that:
This week I read ch. 5+6 in The Truthful Art by Alberto Cairo, and then I continued with Ch.3+4 in How Charts Lie by Cairo. It's nice knowing that "how will I know if I chose the right graphic form to represent my data?" is on the mind of even the most experienced professionals. The more I learn about graphic visualization, the more I seem to feel overwhelmed by the various choices that I have. However, reminding myself of our natural ability to identify visual patterns and outliers helps me trust my instincts. Here are some tips from Cairo:
Think about the message you wish to convey; plot what you need to plot.
Try different graphic forms.
Arrange the graphic's components so that it is simple to extract meaning.
Test the outcome yourself and with others.
Also, I appreciated the Richard H. Thaler references.... my dad just left Misbehaving on my stack of school books to read this semester.
I enjoyed exploring Severino Ribecca's Data Visualization Catalogue online. It is very user-friendly, has a nice interface and designers of any level can find its descriptions useful. I like that you can search by function or by list view, but ultimately the most useful visual of the essentials was what was derived from Cleveland and McGill's hierarchy; I like the idea of grouping the tasks based on enabling specific or more general estimates since it goes along with tip #3 from above. However, I do not think the most successful chart necessarily is defined by being at the top of the hierarchy; again, it depends on what data is being presented and what we want to stand out to viewers.
Figure 5.7, the diagram of European asylum seeker application decisions, was one of the most complex diagrams I've looked at. I understood the concept by reading the title and labels, but the first and last column involving the origin country confused me. For me, there was too much information overlapping in the same visual for me to decode at once; I would have preferred multiple visuals that had more clarity in each.
I also got a good review of our friends mode, median and mean. Resistant statistic, histograms and the weighted mean were topics I had completely forgotten about from Stat class and I appreciated the refresher. I learned that the skew of a distribution in a histogram is a good place to begin when exploring other possibilities with the data. Speaking of exploration, it's important to note that we should not rely on one statistic, chart, or map when pursuing exploratory work.
"Garbage in, garbage out". Faulty data leads to faulty visuals. The first thing to look for when reading a chart is what sources the author has identified. No credited source? Red flag. Furthermore, make sure to identify what is being counted and how it's being counted. I feel like I've already developed a more critical eye for reading charts and checking their sources, and I also know to be mindful about what charts or graphs I'm sharing online with other people.
It was interesting to read about the real reason why porn consumption per capita (provided by PornHub) in Kansas was so much higher - a glitch in the data. If people are using a VPN and location cannot be determined, geographical assignment automatically places them at the center of the contiguous United States, which is Kansas.
I also couldn't believe the similarities between the 2017 chain migration chart and the 1930s chart depicting the result of allowing the "inferior race" to multiply.
An important point to remember: any chart is a simplification of reality.
R for Journalists Ch.2+3
IMPORTING CVS FILES
Something nice about R is that no package is required for importing CSV. The package readr can still be used to assist, though. You can get data through the URL or a local file.
Base R function to import CSV: read.csv(“url”, stringAsFactors=F)
The line of code stringsAsFactors=FALSE is needed because data is read as factors by default, but need to be treated as strings instead.
The plus side of using readr package is that it assumes that characters are strings and not factors.
When you are done manipulating the data, save your dataframe as a CSV file with write_csv() from the readr package:
Example: write_csv(df_csv, "transformed_data.csv")
Most likely the exported CSV file will contain NAs, so we need to replace these with blanks, which can be accomplished with the code found below:
Example: write_csv(df_csv, "data/transformed_data.csv", na="") oreplaces NAs as blanks
When importing excel files, the Readxl package needs to be installed in R. Also, the excel sheet needs to be downloaded locally before exportation to R; it can't just be accessed from the online server.
Example: drug_deaths <- read_excel(“local source here”, sheet=1, skip = 2)
^It's important to always specify what sheet number you are working with, and, if skipping rows, the number of rows to skip.
There are two ways to get rid of NAs (Missing data). Subset () is a function of base R. The alternative is filter(), which is a function of the library dplyr.
Base R example: drug_deaths <- subset(drug_death, !isna(Year))
Dplyr example: drug_deaths <- filter(drug_death, !is.na(Year))
CSV > EXCEL
When importing data, csv is preferred to excel; simple + more compatible. However, it can only import one sheet at a time and it does not include formatting.
My preliminary ponderings on delimited pipe files were the following:
"read_tsv() vs read_delim()? What the hell is a delimited pipe file"
Upon further research, I found a very good explanation:
"A pipe delimited text file has this character “|” as a separation character between data fields.
John|Smith|100 n main street, apt 1|555–555–5555|City, State|
Do you see why pipe may be better than a comma when separating fields? It’s a character you have to enter in the other character option when importing data into Excel or Access. It is a less commonly used punctuation mark so fairly safe to use as a separator character." -Julie Frey from Quora
Suuuper straightforward to use Json data: You just grab data from online and then use jsonlite library.
1. Grab URL for data:
2. Save URL data in a variable:
stations <- fromJSON(json_url)
3. Viewing. Notice the difference that a data frame makes with the data appearance below:
Imported data (as is):
Imported data, as-is.
Imported data, with data frame.
SPSS stands for Statistical Package for the Social Sciences and is owned by IBM. It’s also very expensive and usually only large businesses or organizations own licenses. It provides a graphical interface useful for even deeper analysis.
Copying and pasting data: tibble is more forgiving with bringing in information with odd characters. Use it.
https://github.com/andrewbtran/muckrakr = super helpful for combining spreadsheets from the same folder.
str_sub(strings, start, end) extracts and replaces substrings.
The only problem I encountered was towards the end of Ch. 3, when I received the notification on R that Package “glue” not available (as a binary package for R version 3.5.1.) in dealing with the dates section of the tutorial. I assume there was an update and the library wasn't compatible with the update?
This week I read the intro and chapters 1 + 2 in The Truthful Art by Alberto Cairo. A big takeaway for me was the idea that infographics and data visualizations should exist to inform people, not merely for the purpose of entertainment or selling them anything. As Cairo says, the main goal should be "to increase society's collective knowledge". We see so many pointless pictures, stories, posts, statuses, etc. that don't move society forward in any way. What if more people kept this thought in mind while they put themselves on the internet? (Not that this would ever happen, but can you imagine what that would be like?) There were several basic definition that I was glad to review, such deceit not necessarily being conscious. Even the super basic terms of this course- I wouldn't have been able to put a very specific definition to them. I'll keep them here as a reminder to myself- in Cairo's words:
Visualization: any kind of visual representation of information designed to enable communication, analysis, discovery, exploration, etc.
Chart: Display in which data are encoded with symbols that have different shapes, colors, or proportions.
Infographic: Multi-section visual representation of information intended to communicate one or more specific messages.
Data Visualization: Display of data designed to enable analysis, explorations, and discovery.
A big realization for me was that data visualizations aren't supposed to inform viewers of predetermined conclusions; viewers are left to come to various conclusions on their own. Another term I had never heard of before was a News Application, in which data can be uniquely customized to each individual. A good example of this was the "Health Care Explorer". I was most interested in the multimedia elements of the "Beyond the Border" project. I liked the interactiveness of the project, and the different layers really helped the viewer understand the information being presented to them. The design is clean, simple, and does not distract.
When reading about successful visualizations, I was especially drawn to the section that discusses beauty. Cairo mentions this famous quote by Roger Scruton: "Art moves us because it is beautiful, and it is beautiful because it means something. It can be meaningful without being beautiful; but to be beautiful it must be meaningful." As an undergrad music major, I also think about the application of the statement to the performance of classical music. In both visualizations and music performances, the beauty of the product is created by a widespread audience experiencing is as beautiful.
I also read the intro and chapters 1 + 2 in How Charts Lie by Alberto Cairo. Of course, for me and I'm sure a lot of other people, the first thing that came to mind when reading the title of this draft was politics. A scary reality that Cairo addresses is that as ideology polarity increases, so does the divide in trusting/mistrusting information presented in charts.
I think it was important to point out the flaw in Paul Krugmann's chart about the US murder rate- why only post through 2014? Cairo points out to "never attribute to malice what could be more easily explained by absent-mindedness, rashness or sloppiness". With a track record like Krugmann's, it seems unusual that this was overlooked. The debate then turns into what information is considered relevant. After learning of the actual percentages that were being compared in the tax rate chart produced by Fox News in 2012, I just can't imagine being okay with creating such a gross exaggeration as this. It makes me wonder what the employer's specific instructions were for creating this graph... However, in a way, this is an example of what one person considered to be relevant data.
In general, a big takeaway for me was realizing that all charts can be misleading, even the best ones. I really liked the tricks that were listed to assist in dissecting a chart: looking at scale labels, dispersion of data, thinking about the chart having imaginary quadrants, and drawing a line to mark the apparent relationship between x and y.
I had never heard of a treemap before, and I find them extremely disorienting. I understand that the size of the rectangle is associated with size, but why is each specific rectangle where it is in the chart? Why isn't Mexico below the US, and Brazil below that? It seems that a pie chart is an easier and more familiar way to understand the same information.
"A picture is worth a thousand words"... if you know how to read it and interpret it!
^Love that addition. It's especially relevant with how much people skim content these days.
I also really appreciated the glimpse at William Playfair's Atlas. What is easily recognizable today as a line chart was the first of its kind in the year 1785. As somebody who is interested in the study of ancient music, I am used to studying oral traditions that appeared far earlier than the written word. I am curious about verbal descriptions that must have happened prior to Playfair that eventually evolved into the visual aspect of charts, and also how the system of the number chart evolved and influenced this.
Here is the basic scatterplot for ggplot(data=mpg). Geom_point() is the basic function that adds points to the graph to create the scatterplot.
I spent time playing around with the color aesthetic. This graph is the product of:
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
Notice how the color is listed outside of the aesthetic values, resulting in the color being applied universally.
Compared to Figure 2, color value placement is inside the parentheses, which adds another variable to the chart:
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = drv))
One of the exercises asked for the advantages and disadvantages of a subplot. On one hand, it seems easier for viewers to understand various layers of data. However, If a data set was quite large, it seems like many subplots would be overwhelming for the viewer.
I also learned about geoms and how multiple geoms can be used to represent data in a plot. They are represented by geometrical objects, and the data is typically grouped automatically for discrete variables.
Mappings can create different layers within a geom function. This allows different aesthetics in different layers. The first set of code places both geoms in the same function:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point() + geom_smooth()
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + geom_smooth(mapping = aes(x = displ, y = hwy))
I then tinkered with coord_polar. I was a little confused why x= factor had to have either a length of one or the number of values in the chart. Still researching, I'm sure it will make more sense the more I learn:
ggplot(data = mpg, mapping = aes(x = factor(1), fill = class)) + geom_bar(width = 1) + coord_polar(theta = "y")
Another area I need more clarification on is stats. I understand stat_summary(), but I do not understand why geoms and stats are generally interchangeable, or why stats are used for overriding default stats and transformed variables. Will research this later in the week.
More discoveries: Concatenate
Charts can be created in R with 2 lines of code:
x <- rnorm(100) -assigns 100 random numbers to x-
plot(x) -basic function to display x-
The Race column below is recognized as a factor variable with three levels- black, hispanic and white. Factor variables are categorical and consist of numerical and string values. I think they are mostly used for statistics, so not sure how much journalists would actually use factor variables.
Time/Date in R
In short: it's complicated. How is R supposed to know that 3.00 comes after 2.59? It makes sense that it wouldn't make sense to the software. strptime() and a vector consisting of strings. library(lubridate) is simplest way to deal with this. Helpful to know: ymd_hms() converts the year, month, date and hour, minutes, and seconds.