Originally published on Gigaom (May 2012)

“The field of [big data] draws on findings from statistics, databases, and artificial intelligence to construct tools that let users gain insight from massive data sets.”

– Michael Pazzani, Intelligent Systems

Pazzani’s quote was originally published in February 1999 and did not reference big data; it referenced knowledge discovery and data mining (KDD). In the ’90s, there was considerable excitement around KDD, a new field of analysis that seemed poised to revolutionize the way we make decisions. Fast forward 15 years or so, and KDD is firmly entrenched within the plateau of productivity; it has profoundly changed the way that organizations and institutions answer questions.

Like KDD, big data’s potential resides within its ability to help us make better decisions and, like KDD, big data is part technological innovation and part cultural evolution. Whether big data is something completely new, an iteration of KDD or simply a fad is a matter of some debate.

To help shed some light on the debate, we’ll take a look at some of the unique attributes of big data.

The social media tidal wave

For data to be exist, something must happen, be observed and accounted for. In other words, if a tree falls in a forest, it can only become data if someone observes the tree falling and makes a note of it.

Ten years ago, if you were to have a good meal at your favorite restaurant, you might tell your friends and family about it the next day. Data would have been created and summarily disappeared. Today, when you arrive at the restaurant, you check-in on Foursquare. While you’re eating, you might take a photo of your meal and post it to Facebook. After the meal, you might share a review on Yelp. All of these actions would set off a chain reaction among your friends and family and, all the while, massive amounts of of data are being generated.

Social media adoption has surged over the past five years and has virtually become a societal standard. Companies such as Facebook and Twitter have provided the world with a social platform and in the process, created a living, breathing archive of human activity filled with potential knowledge. This unprecedented tidal wave of user-generated data is a major contributor to big data, however big data is not only about incrementally new data, it is also about increasing access to existing data.

The migration of data into the cloud

The increased flow of data is not only being driven by innovations in technology and social practices, it is also being driven by broader cultural and institutional shifts. The cloud animates formerly lifeless artifacts, transforming them into social, data-generating objects.

In a pre-social media environment, people would host digital assets such as photos, documents and contacts on local machines, where these objects would sit lifeless and dormant. Today, photos are hosted on Pinterest, where they can be rated and shared; documents are hosted on Google Docs, where they become points of collaboration; and contacts are hosted on LinkedIn, where they become associated with actual people.

A new age of openness and transparency

In August 2000, the U.S. Securities and Exchange Commission issued Regulation Fair Disclosure, which mandated that all publicly traded companies would have to share material information in a non-exclusive manner via a publicly available database. This development was part of a broader societal shift toward greater openness and transparency.

Some institutional shifts (such as within the financial sector) have been motivated by public policy, some have been based on ideology, and others have been motivated by the growing realization that the benefits to openness outweigh its risks. Industry has shown us frequent examples of how openness can lead to valuable market feedback, brand loyalty, strategic partnerships and a host of other tangible benefits.

But making more data available is just one-half of the big data equation. The other half is how we make sense of that data. For data to become information (and ultimately knowledge), it must be interpreted.

It’s all semantics

The concept of adopting a common format for how data is structured on the web is not new — the history of XML can be traced to the late 1980s, before the Internet explosion. What is relatively new, is that institutions are finally beginning to adopt common data formats. Consider the following examples:

Over the past decade, NewsML has become a standard within the news industry, allowing news producers and publishers to easily share and organize content.
The financial industry has begun to adopt XBRL as a standard for structuring business information.
Local and Hyperlocal organizations have begun geotagging content to provide users with a more regionally relevant experience.
Recently, Google has dropped hints that it will be placing increased value on semantic information as it evolves from a service that provides search results by analyzing content into one that answers questions by analyzing data. As Google turns, so does the marketing industry as a whole, and we can expect the next few years to see even more adoption of common data formats.

The past decade has been witness to some remarkable changes in technology and society. It remains to be seen whether “big data” is the the flavor of the month or something more profound. What is indisputable is that more data is available than ever before and we now have the tools to help us make sense of that data. The next steps are on us.