Visualizing data breaches between 2004 and 2017
Looking for a user-friendly visualization on data breaches
I have created this visualization on data breaches (image left). The idea was to show in a intuitive and user-friendly way the biggest data breaches (greater than 30K records). It includes public and private organizations between 2004 and 2017, focused on method of leak, sector, and year of the leak.
It is also included a hyperlink that allows to go directly to a source of information. The source explains the incident (official announcements, news and press releases, specialized blogs, etc.).
The visualization is based on a data set that I downloaded at the end of 2017 from informationIsBeautiful.net. My main goal was to improve visually some elements (such as comparisons) that were not easy to see in the original visualization (image below). I wanted to do it according to visualization’s best practices and principles. Therefore, the goal was to achieve a more user-friendly visual answer to the main question: What quantities of records were compromised by important data breaches, in organizations and sectors, between 2004 and 2017, and what was the reason?.
Previous data set improvements
In the process of preparing the data for the visualization I saw that the original data set would require first previous improvements. I did some that were needed for the visualization. Next, I am giving you some examples:
- Levels of variables. I am specially interested in Government data because of my professional background. In this case, the original data set (end of 2017) had mixed some levels of variables. For instance, “Government”, “Military” and –at the same time- “Government/Military”. I decided to adapt some of them, looking for more standardized categories, and taking into account the original information source.
- Inconsistent values. I saw that there were inconsistent values. For instance, the case of Wendy’s Restaurant, with a value of 1025 in the field “Records Lost”. This number is indeed the number of restaurants affected, but the number of records affected in this case is currently unknown.
- Broken links. The sources of information were not available in some cases because of broken links. I have fixed it. I have included links to sources that allows to know more about the specific incident.
Keeping in mind other processes such as statistical analysis or data mining, it is important to say that the data set still would require more in depth works -and investment of time- in pre-processing.
Visualization building process
I followed several stages in the visualization building process. The goal of my visualization was to achieve a better presentation of the information that the original data set included. As I said, the first step was to analyze the data and preparing it as source data for the visualization.
Then, I used Tableau software to extract the information and to build every sheet separately, keeping in mind a more simplified and intuitive dashboard. The high number of tags and descriptions have required some decisions in order to preserve the data visualization principles, first with a right overview, and then, filtering by clicking in one or more elements. I have avoided to include every tag in the first steps, allowing to show it in an emergent description when the mouse is on the specific area.
The visualization is very simple. If you want to explore it, you can access here to the English version of the visualization (also available this version in Spanish). I suggest you to select first the variables or categories that you are interested in, in order to filter and compare the data. Then, by clicking in the specific organization you will obtain at the bottom of the dashboard a short explanation. If you want to know more, by clicking on the short explanation, an emergent description will offer you the possibility to access to a source where you can read more about it.
After update it, including the case Aadhaar based following this source (called unauthorized access), there are important differences between 2017 and previous years. The numbers are enormous, specially the number of records affected in 2017. The total of records affected between 2004 and 2017 were more than 8,000,000,000. Hacking has an important rate, but other reasons are also important and it requires right personal data protection measures. Some well known organizations have had repeated incidents along the years (some of them because of poor security).
The data used in this visualization are related to 269 incidents -data breaches greater than 30K records- (you can download the datasets here in Kaggle). If we would also include fewer than 30K, the total number of incidents would be approximately 7,900. It would increase the number of records affected to 11,000,000,000 approximately.
If you are interested in a full data set of incidents (thousands), in this Privacyrights.org site you can download the data in CSV format.