November 20, 2012, by Brigitte Nerlich

Big Data: Challenges and opportunities

With increasing frequency one can read announcements welcoming us to the age of Big Data (put the phrase “welcome to the age of big data” into google and you’ll get over 488,000 results). Reading about two recent events in particular sharpened my awareness of this new era, namely a big data event at the British Academy and an STS seminar at UCL entitled ‘Big data; Big Deal’. The British Academy talked about Big Data as a tsunami, the STS seminar as a deluge. As people who follow my blog know, metaphors are moi, so to speak, but Big Data are not. I normally deal only with Small Data! However, when following the trickle of water metaphors, which soon became a bit of a stream, I realised that Big Data are perhaps more moi (indeed us), than I would have expected, and I wanted to know more. So, what follows is a rather limited and largely derivative exploration of what Big Data may mean to me and to us, and more indirectly perhaps what making data public may mean for making science public.

Water, ice and rock metaphors

When I began to scrutinise Big Data, a host of water-related words and phrases (and images) appeared on my computer screen: (data) stream, flow, flood (nice visualisation here), torrent, tsunami, deluge, being swamped by, swimming in, drowning in, channelling and so on, alongside some ice and snow metaphors (data avalanche, data iceberg), some related but different rock(ish) metaphors (data mining, data gold mine, data explosion, and so on), as well as repositories, troves and so on. And of course there are ‘clouds’ (and crowds). But what are we being swept along by, what is exploding? What are the data? What is big? And what are the problems, if any?

What’s big about Big Data?

What data are we talking about when we talk about Big Data? What is big? Are we talking megabytes, gigabytes, terabytes or petabytes? More the latter than the former, it seems. Every day, Google alone processes about 24 petabytes (or 24,000 terabytes) of data. And that’s only the tip of the data iceberg. According to Wikipedia, “[w]hat is considered ‘big data’ varies depending on the capabilities of the organization managing the set. ‘For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.’” Steve Lohr wrote recently in the New York Times: “There is a lot more data, all the time, growing at 50 percent a year, or more than doubling every two years, estimates IDC, a technology research firm.” However, Big Data are not just defined by size; it’s also about complexity, speed and volume.

Where do Big Data come from?

Big Data are the backbone of the big sciences, such as physics, astronomy (e.g. the Sloan Digital Sky Survey), climate science, and so on. These and other sciences accumulate and work with datasets that one may call ‘big science data’. As an article entitled ‘Taming Big Data’ pointed out: “At the Large Hadron Collider in Europe (the world’s largest particle accelerator), just one of the several detectors (called Atlas) generates 23 petabytes of data each second of operation. That is 23 million gigabytes per second (for comparison, entry-level iPads and iPhones have 16 gigabyte capacities)” (although other sources talk about just one petabyte per second or one terabyte for Atlas and CMS together).

The term Big Data refers more frequently to what one may call ‘big social data’, that is “increasingly accessible sets of public data on anything from education to road safety”. Big data are not just ‘out there’ (in the cosmos, atmosphere or genomes), they are, in a way ‘us’. They are the ever-increasing traces of our digital selves and our digital lives. As IBM tells us: “Every day, we create 2.5 quintillion bytes of data”. These data may be used for scientific or policy endeavours, such as medical science or transport policy. Some even see them as “keys to solving huge societal problems”, making better predications (about weather, earthquakes, epidemics and so on), saving lives and coping with disasters.  More commonly though, such public data have come to be seen as a new commercial currency, as a new kind of gold that can be mined for profit. We ourselves, every one of us, are involved in the production of this new gold (public or social data). We contribute to it willingly through crowd-sourcing for example, or unwittingly though everyday digital activities. We basically give these data away.

Making sense of Big Data

Big Data pose (technical) questions about how to store, manage, archive and analyse the data, but also, and more importantly, about how to interpret them, make sense of them, what one may call a hermeneutical question. And here the trouble starts: to make sense of Big Data, you need complex data mining programmes, machine learning and algorithms. You may also need sophisticated data visualisation programmes that churn out ever more enticing images, which are supposed to (aesthetically) summarise the data for you and make them amenable to human understanding. But (and there are a lot of buts in this blog) who are the people who really can interpret these (seemingly intuitive) data visualisations (and some are more intuitive than others) and makes sense of them? Are these people like you and me, or are they a new kind of experts (data scientists, data miners, so to speak, working to extract gold from the data mines and/or painting pretty pictures with it)? Or, indeed, do Big Data signal the end of expertise as we know it?

Big Data, policy and decision-making

What is the impact of Big Data on policy and decision-making? Again, some argue that Big Data will lead to better decisions, in the sense of more reliable decisions, unencumbered by the vagaries of the human mind. But can we really ‘compute ourselves to better decisions’ or will ‘we’ be computed out of decision-making? Are we drifting “toward data-driven discovery and decision making”? Will politics and pundits be replaced by Big Data? As one document that surfaced from the flood of online articles on big data pointed out: it is important for “researchers and developers [to] balance automated decision-making against ‘human in the loop’”. How can we keep people in the loop? And amongst the people, I also count us researchers and scientists.

Big data and research

So, what are the implications of Big Data for research, scientific scrutiny and public engagement with science? Will scientific research be dominated by data-intensive research methods that drive a new type of scientific discovery process? What tools are there (apart from increasingly Big Technology) that would allow ‘us’ (whether scientists or people or both) to engage with Big Data, to “interpret and critique the increasing amount of data now available“? Or would that be done for us by ever more sophisticated algorithms instead? Who will be ‘doing’ the research of the future? And how can the wider public engage with such research? As John Naughton pointed out in The Observer on the day I was finishing this blog (18 November) “Whereas once we dreamed up theories and then looked for data to corroborate or refute it, we will increasingly use computer analysis to spot patterns and connections that may have theoretical significance.” By the way, a new academic (open access) journal called ‘Big Data’ will be launching in 2013….

And I haven’t even mentioned Big Data and Big Brother, Big Data and Big Corporations, Big Data and Nate Silver…. But as this is a blog within the ‘Making Science Public’ series, the last question section should be about Big Data, open data, open access, public access etc… (and don’t expect any answers!)

Big Data, public data, open data, open access…

A big part of Big Data comes from ‘the public’, are out there ‘in public’ to be gathered and used, but are they ‘public’ in the sense of ‘Open to the knowledge or judgment of all’? In principle yes, in practice perhaps not quite yet. But things are changing fast. First of all, there is (perhaps) a difference between big science data and big social data. Big science data (see human genome) are increasingly open data, seen as a public good and made available to open access. Big social data tend to be collected and stored by big corporations and governments (are more private data), so access and use have been more restricted, despite the fact that these data are us, so to speak. However, things are changing here too, as some governments have responded to the challenges posed by big data and open access by saying: “throw open the floodgates”! And even businesses seem to be getting behind open (big) data and data transparency, for a variety of reasons (e.g. ‘monetizing data assets’), which need further scrutiny. Transparency, open access and open data are not necessarily a universal panacea.

Big Data are here to stay; they offer challenges and opportunities to science, society, politics and publics. Let’s not get swept away by them. We have to keep an eye on Big Data because Big Data are already keeping an eye on us.

Added January 2013: good reflections on limits of big data here (NYT); and added 11 May 2013: Good literature ‘bundle’ collected by @DALupton here.

Added 2 June 2013: Good discussion of ‘myths’ around big data and data fundamentalism by @katecrawford

Added 1 February 201f: More sociological reflections on big data metaphors by Deborah Lupton here.

And for a critical dissection of the commercialisation of social science and the displacement of critique, read John Holmwood here

Image: Floodgates, floating drum design. Faraday diversion dam, Clackamas river, Oregon, US. Wikimedia Commons

 

Posted in big dataopen accessvisualisation