By Tiernan Ray
Michael Stonebraker is a living legend in the database world.
Not a year goes by I don’t hear this or that tech startup or industry executive refer with awe and admiration to his work of the past 40 years.
Stonebraker refined therelational techniques that today form the heart of billions of dollars in annual software sales by Oracle (ORCL),International Business Machines (IBM), and Microsoft (MSFT).
Friday afternoon, I had the honor of speaking with Stonebraker by telephone, to congratulate him for being this year’s recipient of the Association for Computing Machinery‘s “Turing Award” for his groundbreaking contributions to computer science. More on the award can be found on the ACM Web site.
Stonebraker, a professor at MIT‘s Computer Science and Artificial Intelligence Lab, said he was honored, remarking that “This is the Nobel Prize of computer science.”
(Stonebraker’s full bio can be found on the CSAIL Web site.)
Stonebraker not only continues to do fundamental research into database theory, but also has deep and informed views on the state of the industry that are of value to any tech investor.
Among the provocative views he shared with me is that current database technology from Oracle and others is “obsolete,” and that Facebook (FB) is grappling with “the biggest database problem in the world.”
Stonebraker’s chief claim to fame is having pushed forward the relational technology first conceptualized by E.F. ‘Ted’ Codd in a seminal paper in 1970.
Stonebraker’s companies, Ingres, then Postgres, and many more after that, such as Vertica(acquired by Hewlett-Packard (HPQ) ) helped bring Codd’s academic notions to fruition in the commercial marketplace.
When asked what he contributed, Stonebraker is remarkably humble: he deftly switched the conversation to a glowing summation of Codd, whom he refers to as “Ted.”
“What Ted proposed at the time was radical,” he said. “It was a complete change from how things were being done in database.”
Ted saw two important things. At the time, there were databases such as IBM’s IMS, whuch was structured as a hierarchy, and the CODASYL database, which was structured as a network of connection between objects. Ted realized that what people inherently understand are relations, and so he turned the problem of data management into one of relations. That dramatically simplified things. He saw that things must be kept simple. Ted really followed the KISS principle [Keep it Simple, Stupid].
The other big breakthrough by Codd, says Stonebraker, was moving the actual manipulation of data away from assembly language programming of the time to higher levels of abstraction that would later become structured query language, or SQL.
The conventional wisdom at the time was that you should build for the particulars of how the data is stored. He saw that made no sense. He brought principles of encapsulation and abstraction to programming databases, like with a high-level-language in programming. The problem with the assembly approach was, your data lives a very long time. Today, your business might be in plumbing supplies. But then then you merge your company with another company, and now you’re in plumbing supplies and beauty supplies. Inevitably, your data structures will change as a result. If you write these assembly language programs, you have to throw them away and recode when things change, whereas if you write at a high level, with data independence, your data will not be dependent on the structure of the data.
As for his own contribution, Stonebraker is equally humble. In conversation, he uses “We” instead of “I,” and he notes that the Ingres database “was the contribution of quite a number of people.”
Among them, Jerry Held and Gene Wong joined Stonebraker in receiving the ACM’s “System Software Award” in 1988 for Ingres.
Stonbraker notes that he and his fellow pioneers brought Codd’s lofty relational ideas into the realm of ordinary individuals:
Ted was a mathematician, and he wrote his things in mathematical terms that no mere mortal could handle. We turned it into constructs that could be manipulated by ordinary people. Second, it was argued at the time that RDBMS couldn’t perform, but we showed it could be efficient.
Oracle, Microsoft, IBM have a problem
Turning to the present, Stonebraker tells me Oracle’s database, IBM’s DB2, and Microsoft‘s SQL Server are all obsolete, facing a couple major challenges.
One is that at the time, they were designed for “business data processing.”
“But now there is also scientific data and social media, and web logs, and you name it! The number of people with database problems is now of a much broader scope.”
Second, “We were writing Ingres and System R for machines with a small main memory, so they were disk-based — they were what we call ‘row stores‘.”
“You stored data on disk record by record by record. All major database systems of the last 30 years all looked like that – Postgres, Ingres, DB2, Oracle DB, SQL Server — they’re all disk-row stores.”
But, with the fall in the cost of memory chips, “main memory is now cheap enough that OLTP [online transaction processing] is going to be main-memory databases, increasingly, and they don’t look like disk-based row stores at all,” contends Stonebraker. He cites as examples the newer “in-memory” database Hana from SAP (SAP), and also his own new initiative, VoltDB.
The third of the database market that’s legacy Oracle and SQL Server and DB2 will be replaced by things such as VoltDB, and whether Oracle can adapt or die remains to be seen:
Oracle or SQL Server or DB2 are legacy code at this point. There’s that great book by Clayton Christensen, The Innovator’s Dilemma. All the system software vendors are up against the innovator’s dilemma. They are selling the old technology, and the question is how will they morph without losing their customer base? There’s no question that with Oracle, the customers are dug in pretty deep in the traditional systems, but my point of view is there is two orders of magnitude performance difference to be had with other technology approaches, and sooner or later that will be significant. It may take a decade or longer for the legacy stuff to actually die away — there’s still a lot of IMS data in production in the real world! — but sooner or later it will get replaced. My point of view is that if you want to do 50 transactions per second, it doesn’t matter what technology you use, you can use whatever you want. But if you want to run 50,000 transactions per second, your current implementation is simply not going to do it. Sooner or later, you are going to be up against a technology wall that will force you to move to new technology, and it will be completely based on return on investment.
Where NoSQL and Hadoop are going
Another third of the market, focused on “data warehousing,” is moving from row-stores to “column stores,” which can be far more efficient, he says. “All the data warehouse vendors have converted to the column stores or are in the process.”
The last third is “everything else,” says Stonebraker.
That includes “NoSQL” databases such as MarkLogic, which I profiled recently; and Hadoop, the open-source database widely used by Google and others, and now commercialized by startup Cloudera and by Hortonworks (HDP).
There are 100 or more of these NoSQL companies, and Stonebraker thinks they will all eventually end up looking like SQL databases. “It started out, NoSQL meant, ‘Not SQL,’ then it became ‘Not only SQL,’ and now I think it means “Not-yet-SQL’,” he quips.
“NoSQL proposes low-level languages, and they are betting against the compiler, and that’s an incredibly dangerous thing to do,” he says, just like the assembly-language programming back in the day. He thinks VoltDB and other approaches can fix the problems brought about by legacy RDBMs, and “NoSQL guys will drift toward looking at SQL,” he contends. “They will move to higher-level languages, and the only game in town is SQL.”
As for Hadoop, it will take on SQL aspects and merge with data warehousing:
If you look at the major vendors there, Cloudera, Facebook and Hortonworks, if you look at what Cloudera is doing, they released the Impala system a little while ago. If you take a careful look at it, it is a SQL engine. MapReduce is nowhere to be found. The historical Hadooop stack was Hive on top of MapReduce, on top of HDFS. Look at Impala and you see MapReduceis nowhere to be found. I think everyone pretty much agrees the MapReduce interface is not very interesting. None of the data warehouse guys have anything that looks like that. So I think MapReduce will atrophy and be replaced by SQL. Impala is a column-store, so it looks like Vertica or Red Shift, or any other data warehouse model. So data warehouse and Hadoop are going to completely merge eventually.
And so, “Hadoop will look like the data warehouse market, and NoSQL will look like the SQL market.”
Arrays, graphs and data science take over
More interesting to Stonebraker are areas such as the “social graph” of Facebook, and the emerging area of data science.
He predicts a lot of business analysts who run data warehouses will be replaced in years to come by data scientists, who are trained to work with arrays rather than tables, and with techniques such as regression analysis, Bayesian analysis, and other approaches represented by programs such as the statistical package R:
Another incredibly dominant trend right now is that the data warehouse market is about business intelligence, it’s about business analysts using Business Objects, and Cognos, and products like that as a GUI [user interface] in front of a SQL system. They are running SQL analytics. But what I think is guaranteed to happen is that business analysts will be replaced by data scientists. It will take a while, because we don’t have enough trained data scientists, but the market will get much more sophisticated. Suppose you are the Wal-Mart guy who has to figure out how to provision Wal-Mart products around major snow storms. The query you want to run is in the week before the storm, and the week after, What sold by department in the North East, and compare that with, say, Maryland — that’s standard business intelligence work. And what comes out is a big table of numbers. An alternatives is to get data scientists to build a predictive model to predict sales by department in the winter. You run that model and out comes a bunch of predictions, which is what the business guy actually wants. Sooner or later, the business intelligence world will move to the data science world, using things like regression analysis, Bayesian analysis — these are lots of big words, but all of these techniques, if you look at them, it’s an array-based, not a table-based calculation. People who do data science now often code in MatLab or R. So, as we transition to data science we are going to transition to array-based calculations. The question is, Are those going to be done on an RDBMS, or is there room for a new class of array-based data manage? I think the jury is completely out, but it’s going to be a sizable market over time and it’s going to happen, maybe not this year, but over time. It’s a possible opportunity for array-based data management. We just built something to do that, SciDB. It is a commercial product that is array-based. There are certain kinds of data science applications that are getting a lot of traction. The genomics market is one that will be huge as all of us get [genetically] sequenced. The things those guys want to do is completely array-based. SciDB is focused on genomics for the short term, but will eventually move into other areas.
Facebook has the biggest problem of anyone
His other point is that Facebook has a big problem: Its problem is a graph problem, figuring the combinations of “vertices” and “edges,” in the language of graph theory, but Facebook is entirely based on the database technology “MySQL,” which means that its underlying infrastructure doesn’t fit the task at hand:
Look at Facebook, it is one giant social graph, with the problem of how to find the average distance from anyone to anyone. You can simulate a graph as an edge matrix, and a connectivity matrix in an array-based system, and you model graphs in a table system, or you build a special-purpose engine to implement the graph directly. All three are being prototyped and commercialized, and the jury is out whether there is room for a new graph engine or if one of the other technologies would be good enough. I think the answer to graph problems is it will be done by either an array or a table DBMS. Facebook has a big transaction processing problem: You “friend” me, and that is an update to the social graph. That’s currently implemented on MySQL, and as of three years ago, they had over 4,000 MySQL instances. It’s probably 10,000 now or more. They would love to get rid of MySQL. They are prototyping everything in sight to explore new approaches. The infrastructure is at odds with the nature of their problem, and at such an extreme scale. I would say they have the hardest database problem on the planet. For Facebook, the question is make versus buy, and like Google and Amazon, they are running at such scale that it tilts them toward make rather than buy.
The upshot of all that is, “Off in the future, there will be a fair number of graph and array problems, and it will be intersting to see how those will be solved over time — that’s equivalent of saying, the database world is alive and well, and will continue to flourish for a while.”
GPUs and non-volatile RAM may again change databases
In closing, I asked Stonebraker what hardware innovations, like faster DRAM, would eventually impact databases.
He said “Two things are very significant,” one being GPUs, or graphical processing units, the kind of chips in which Nvidia (NVDA) specializes, the other being non-volatile RAM.
Regarding GPUs and other “co-processors,”
There will be various co-processor approaches. No one will build one [a co-processor] just for the databases market, because it’s not big enough, so we will have to piggy back on someone else’s technology, and GPUs are here, and we ask, What can we use them for? That is a very active area of investigation. Take a look at Intel’s “Xeon PHI.” At the Intel Science and Tech Center at MIT, one of the things they are having us look at is what to do with PHI, which has very fast floating point performance. Another thing is what to do with FPGAs, among things that hardware guys developed for some other reason.
The thing I think will be way more important is non-volatile RAM, NVRAM. The various vendors are betting on various things, and it is probably coming this decade, and it’s going to be way faster than flash. Flash is too slow to be really interesting — some people are using it [flash] now, but it’s not mainstream. It is going to be very significant. Hewlett-Packard’s MEMRISTOR is one of the technologies; Intel is betting on something, though they won’t tell us what it is.
Correction: a prior version of this post attributed the CODASYL database system to IBM, when in fact it was not from IBM. my apologies for any confusion caused by the error.