The stacks at New York Public Library
What powers machine learning?
What is the oil that machine learning engines run on?
Knowledge. Machine learning engines run on knowledge.
History. Machine learning engines run on history.
The crass word for “knowledge” and “history” is … data.
Machine learning runs on data.
The more data, the better the data, the more innovation. Great data beats great algorithms.
Great innovations will come from machine learning that mines great data. But who will bring about these innovations?
That’s a very important question.
There are two potential futures.
One future is a world where all the important data — consumer behavior, health outcomes, economic data, and the like — resides with a very small number of giant companies. Companies like Google, Facebook, Tencent, and Amazon. These companies jealously guard their data because it’s what gives them their competitive edge.
And because they have a monopoly on the data, they monopolize innovation. Yes, some innovation will still happen within start-ups. But the bulk of the world’s innovation could be concentrated in the hands of the few companies that control the knowledge. And the monetary gains from that innovation will go to an increasingly narrow group of people.
This is not necessarily an evil world. But monopolies slow down technological progress. Monopolies promote scarcity despite a reality of abundance. Monopolies concentrate and stifle economic activity instead of distribute and promote it.
Is that a world that you want to live in?
There is a second version of the future, though. This is a world where access to data — to knowledge and history — is made available to all potential innovators.
In version 2.0 of the world, data is democratized. And innovators everywhere can build on that knowledge.
Open-access to new technologies and core infrastructure drives innovation. The same is true for data.
“It doesn’t matter, we have all the data”
One recent catalyst in machine learning has been the invention of TensorFlow, a machine learning library that makes it especially easy to build deep learning models. In my opinion, it is one of the top ten most important innovations in the last decade. It has already helped create very large gains in deep learning. It is amazing.
TensorFlow is an open-source ML framework by Google
Google put many of its most talented engineers on the team that built TensorFlow. Jeff Dean, arguably the most famous (and highly compensated) engineer at Google, led the team. The company likely spent over $1 billion developing TensorFlow. And then Google did the unimaginable…they open-sourced it.
Not long after Google announced the project, I caught up with a core engineer on TensorFlow. I asked him why Google would open-source such an amazing innovation. Isn’t machine learning the key to Google’s future? He calmly replied “it doesn’t matter, we have all the data.”
Back to our two futures.
In the first future, a small number of companies have all the world’s data. In that world, even the most advanced models and machine learning frameworks don’t matter. Here, a few companies will always win by brute force because core truth data matters more for model performance than do fancy algorithms. In this future, innovation slows down due to lack of competition, and a very small number of companies command most of the innovation rents.
But in the second future — the one where all innovators can get access to the high-quality data that they need — innovation becomes the competitive advantage. Because all innovators have access to truth about the world, the barrier is not access to the past, but imagination for the future. Here we value creativity. Here we value real innovation. Here we focus on the future rather than the past.
The second future is not such a far-off world. We already have democratized access to compute power. Today that’s available to anyone. Open access to compute power (via AWS, Microsoft Azure, Google Compute, and more) has massively accelerated innovation. And no, it’s not free. But it is available to anyone that wants to pay for it. And more and more innovation is happening around cloud compute — like the advent of Docker containers. So the price of the compute power declines every month.
The good news is that there are a bunch of companies and nonprofit initiatives working on democratizing access to different types of data: SafeGraph (my company), Data.World, data.gov, Siftery, OpenStreetMaps, weather services (like Accuweather and The Weather Company), Dun & Bradstreet, SecondMeasure, and many more. There are also data marketplaces and initiatives within non-data companies to make data more accessible, like AWS public datasets, Oracle Data Cloud, Salesforce.com data marketplace, LiveRamp (my old company) data store, Quandl, ESRI data marketplace, and more. But there is still a long way to go to challenge the data monopolies currently accumulating inside the walls of a few giants.
Of course, we need to open up data responsibly. Unlike compute power, data can reveal private matters in a person’s life. The promise of progress (like coming up with the best cancer treatment for each individual) does not need to come with the cost of all of us giving up our privacy.
Data can be very personal, and people should have the ultimate say in whether or not their data is used for analysis. And for those that allow their data to be used, companies need to take extreme care to ensure data privacy. This involves implementing strict safeguards, like ultimately building tools that allows companies to analyze datasets and train models without actually seeing the underlying data, as well as ensuring that any data that companies do see is k-anonymous for some large k.
We need open information to power innovation.
Data should be an open platform, not a trade secret. Information should not be hoarded so that only a few can innovate. We need as many organizations as possible working to solve the challenges facing humanity — and that requires everyone has access to powerful datasets.
When we open the world’s most powerful data — when small startups, new initiatives inside larger companies, and massive companies can all access the same datasets — then competition will increase, innovation will proliferate, and societal growth will accelerate exponentially. More people will benefit from technology, faster. This is the future that I want to live in.
Special thank you to Ryan Fox Squire and Noah Yonack for helping draft this piece.
Stay informed: Follow SafeGraph’s Medium to stay in the loop.
Join Us: We’re bringing together a world-class team, see open positions.