So, what exactly is Databricks? And what is it used for?
Databricks processes data
At its core, Databricks reads, writes, transforms and performs calculations on data. You’ll see this variously referred to in terms like “processing” data, “ETL” or “ELT” (which stands for “extract, transform, load” or “extract, load, transform”). They all basically mean the same thing.
That might not sound like a lot, but it is. Do this well, and you can undertake pretty much any data-related workload.
You see, this processing — these transformations and calculations — can be nearly anything. For example, they could be aggregations (e.g. counts, finding the maximum or minimum value), joining data to other data, or even something more complex like training or using a machine learning model.
To tell Databricks what processing to do, you write code. Databricks is very flexible in the language you choose — SQL, Python, Scala, Java and R are all options. These are coding languages that are common skills among data professionals.
Databricks uses Apache Spark to process data
Sitting at the heart of Databricks is the engine that does this data processing: an open-source technology called Apache Spark. And this is no surprise. Spark is the dominant data processing tool in the world of big data, and Databricks was founded by the creators of Spark.
So why not just use Spark instead? Well, you can if you really want to. To do the data processing — to run Apache Spark — you’ll need a cluster of computers. That’s multiple computers (called “nodes”) working together, each with their own memory and each with multiple cores. The data is distributed and the tasks that form the data processing workload are performed in parallel across the nodes and their cores. This distributed and parallel design is critical for working with large data and for scaling into the future.
But spinning up, configuring, altering and maintaining a cluster is a pain. And installing, configuring, optimising and maintaining Spark is a pain too. It’s easy to spend your time and effort just looking after these, rather than focusing on processing your data, and thereby generating value. (And, yes, that includes using cloud virtual machines or cloud-native, managed Spark services.)
Databricks takes away that pain. Databricks allows you to define what you want in your clusters, and then looks after the rest. Clusters only come into existence when you need them and disappear when you’re not using them. Spark is already installed and configured. It even auto-scales the clusters within your predefined limits, meaning it can add or subtract nodes as the scale of the processing increases or decreases. It all means you can focus on your data processing and therefore generating value, rather than managing supporting the infrastructure.
Even better, the Spark that runs on Databricks is heavily optimised, as are the clusters that Databricks uses. This means that Spark runs faster and more efficiently on Databricks than anywhere else. (Remember, the Databricks folks are the very same ones who created Spark.)
Ok, so Databricks is essentially about processing data. It does it using the dominant data processing technology for big data. And it then runs that better than anywhere else. However, the real trick is that Databricks then builds on such a flexible and performant core to extend it into an entire data platform.
How’s Databricks different from a database or data warehouse?
Databases and data warehouses can process data too. But their engines are fundamentally designed to query data with low latency. Basically to be responsive when you ask questions of your data, particularly on smaller quantities of data.
Databricks, using Spark, is designed for throughput. It’s a workhorse that’s designed to process data at scale. To perform those transformations and calculations super-efficiently, and to shine as data gets large.
In addition, to improve its query performance, Databricks has introduced another engine called Photon, which is compatible with, and complementary to, Spark. Spark plus Photon is how Databricks covers the length of the data processing spectrum.
However, when comparing Databricks with databases or data warehouses, there’s another key difference: how and where your data is stored.
Databricks reads and writes data, but you control where and how your data is stored
A database or data warehouse not only processes your data using its own query engine, it also stores your data in its own format. You can only access that data through using the database or data warehouse. And in some cases, once you put your data in there, you need to pay to read that data out.
Databricks doesn’t store data. (Granted, there are some subtleties here. But this statement and the following all holds when implementing Databricks using best practices.)
Databricks reads data from storage and writes data to storage, but that storage is your own — depending on your cloud of choice, your data will be in Amazon S3, Azure Data Lake Storage Gen2 or Google Cloud Storage.
And Databricks doesn’t require the use of a proprietary data storage format, it uses open source formats, although it can read from and write to databases too. The choice is yours.
The net result is that you always have full control of your data. You know exactly where it is and how it is stored. You’re not locked in either: if you want to access your data without using Databricks, then you can.
Databricks combines your data lake and data warehouse into the data lakehouse
Basic object data storage, like those of the cloud providers, is super flexible. It’s how you make a data lake, which is one of the keys to having a successful data science and machine learning capability. But data lakes provide few guarantees and little robustness.
So, Databricks have developed and released their own open-source data storage format, called Delta Lake. Delta Lake extends upon the open-source Apache Parquet storage format (which is Spark’s preferred storage format) by adding a “transaction log”, which is a list of all operations performed on your data. But the data itself remains in the well-known Parquet format, and can be accessed without using Databricks or even Spark.
Using Delta Lake provides “ACID compliance” (atomicity, consistency, isolation and durability) to your stored data. This means you get:
• Guarantees on reading and writing your data that you normally don’t get without database-style storage
• The ability to read and write batches of data and streams of real-time data to the same place
• Schema enforcement or modification, like you would with a database
• “Time travel”, which means you can read or revert to older versions of your data
Bottom line: With Delta Lake, Databricks can treat your data that sits in a data lake on cloud storage much like it’s in a data warehouse. You get the benefits of both the data lake and data warehouse. And so, Databricks allows you to combine the concepts of a data lake and data warehouse into the “data lakehouse”. It’s a very powerful concept and a great way of simplifying your data systems.
If you read material from Databricks, including their website, you’ll see they’re big on the Lakehouse. Now you know why.
As important as Spark and Delta Lake are, Databricks is more than just those
On top of its data processing engine, Spark, and its preferred storage format, Delta Lake, Databricks has a variety of other features that allow you to make the most of your data.
It enables an end-to-end workflow for machine learning projects and data science. Databricks clusters can be spun-up with machine learning packages and even GPUs for exploring data and training models. Data scientists and machine learning engineers can use interactive notebooks to write their code, which are similar to (but different from) Jupyter Notebooks.
Databricks then enables the whole “MLOps” (DevOps for machine learning) lifecycle with another piece of integrated open-source software called MLflow, and its slew of machine learning features that it packages together under the banner of Databricks Machine Learning.
For data analysts and business intelligence professionals, Databricks also offers Databricks SQL. This is an interface and engine that looks and feels like a database or data warehouse interactive development environment. They can write SQL queries and execute them like they would against more traditional SQL-based systems.
From there, it’s even possible to build visuals, reports and dashboards. Or you can hook Databricks up to their preferred business intelligence tooling like Power BI, Tableau or Looker.
There are heaps more features to Databricks that further round out its capabilities as an all-around data platform, and more are consistently being added. Conceptually, the goal is to make it the one place that a data team can go to do whatever data-related work they need to accomplish.
Databricks runs in the cloud
Databricks is available on top of your existing cloud, whether that’s Amazon Web Services (AWS), Microsoft Azure, Google Cloud, or even a multi-cloud combination of those. Databricks does not operate on-premises.
It uses the cloud providers for:
• Compute clusters. In AWS they're EC2 virtual machines, in Azure they’re Azure VMs, and in Google Cloud the cluster runs in Google Kubernetes Engine.
• Storage. As mentioned earlier, Databricks doesn’t store data itself. Instead data is stored in native cloud storage. In AWS that’s S3, in Azure it’s Azure Data Lake Storage Gen2, and in Google Cloud it’s Google Cloud Storage.
• Networking and security. This includes integrating with your existing networks, identity and access management, and storing and accessing secrets.
If you want, you can connect and use Databricks with other cloud native tools and services. But it plays really well on its own too.
Once deployed and configured, your data team accesses a Databricks workspace through its own browser interface. You don’t need to go through a cloud console or the like. The team can effectively just do its work through Databricks and, in general, doesn’t need to know about the details of the cloud underneath.
Databricks is a single data platform for all your needs
Bringing all of this together, you can see how Databricks is a single, cloud-based platform that can handle all of your data needs. It’s the data lakehouse. It’s the place to do data science and machine learning.
Databricks can therefore be the one-stop-shop for your entire data team, their Swiss-army knife for data. A place where they can all collaborate, together, rather than using a complex mix of technologies.
It can unify and simplify your data systems, mixing all sorts of data that arrives in all sorts of different ways.
Plus, Databricks is fast, cost-effective and inherently scales to very large data. Done well, you can architect it once and then let it scale to meet your needs.