Operate Vault in recovery mode

11min
|
Vault

Challenge

In exceptional circumstances, you can face the need to troubleshoot issues with a Vault server, such as configuration changes which cause it to become unavailable for general use.

Recovery via snapshot is not always a viable solution in such extreme cases, often because the issue root cause can prevent a Vault server from starting or servicing user requests.

Diagnosing and resolving such exceptional outage states can require that you access the storage at a low level that is impossible with a running Vault cluster.

Solution

Users of Vault version 1.3.0 or higher can operate Vault in recovery mode to troubleshoot and recover from some extreme circumstances when other methods are unavailable.

Recovery mode allows for direct low level interaction with raw portions of the internal storage for any supported storage type.

Vault limits recovery mode operation to list, read, delete and write operations against keys and values contained under the root path /sys/raw/.

While operating in recovery mode, Vault is not available for responding to standard user requests, and instead just provides the minimum functionality required for maintenance and recovery purposes.

You can learn more about operating Vault in recovery mode by following the lab in this tutorial.

Warning

Ensure you have a backup or snapshot of the Vault server data before using any of the information from this tutorial in a live setting.

Prerequisites

To perform the steps in this tutorial, you need Vault. The Community Edition is suitable for this tutorial.

The Vault foundations tutorials is a great starting point if you are not familiar with Vault.

Some examples use, but do not necessarily require jq for formatting JSON output.

Prepare environment

Create a temporary directory to contain the work you will do in this scenario, and assign its path to the environment variable LEARN_VAULT.

$ mkdir -p /tmp/learn-vault-recovery/data && \
  export LEARN_VAULT=/tmp/learn-vault-recovery

Write the example configuration

You will begin the scenario with the example configuration file, vault-server.hcl.

Write it to the scenario home directory.

$ cat > "${LEARN_VAULT}"/vault-server.hcl << EOF
api_addr                = "http://127.0.0.1:8200"
cluster_addr            = "http://127.0.0.1:8201"
cluster_name            = "learn-recovery-server"
default_lease_ttl       = "10h"
disable_mlock           = true
max_lease_ttl           = "10h"
pid_file                = "$LEARN_VAULT/pidfile"
ui                      = true

listener "tcp" {
  address       = "127.0.0.1:8200"
  tls_disable   = "true"
}

backend "file" {
  path    = "$LEARN_VAULT/data"
  node_id = "learn-recovery-server"
}
EOF

Insecure operation

The listener stanza disables TLS (tls_disable = "true"). In production, Vault should always use TLS to enable secure communication between clients and the Vault server. It requires a certificate file and key file on each Vault host.

Start Vault server

$ vault server -config $LEARN_VAULT/vault-server.hcl
==> Vault server configuration:

             Api Address: http://127.0.0.1:8200
                     Cgo: disabled
         Cluster Address: https://127.0.0.1:8201
              Go Version: go1.16.5
              Listener 1: tcp (addr: "127.0.0.1:8200", cluster address: "127.0.0.1:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled")
               Log Level: info
                   Mlock: supported: false, enabled: false
           Recovery Mode: false
                 Storage: file
                 Version: Vault v1.8.0
             Version Sha: 82a99f14eb6133f99a975e653d4dac21c17505c7

==> Vault server started! Log data will stream in below:

2021-08-12T15:21:47.361-0400 [INFO]  proxy environment: http_proxy="" https_proxy="" no_proxy=""

In another terminal session, export the VAULT_ADDR environment variable to address the Vault server.

$ export VAULT_ADDR=http://127.0.0.1:8200

Initialize Vault, and write initialization output to the file named .vault_init in the temporary scenario directory specified by $LEARN_VAULT.

$ vault operator init \
    -key-shares=1 \
    -key-threshold=1 \
    > $LEARN_VAULT/.vault_init

Insecure operation

Do not run an unsealed Vault in production with a single key share and a single key threshold. This approach is just used here to simplify the unsealing process for this demonstration.

Set the environment variable UNSEAL_KEY with the unseal key as its value.

$ UNSEAL_KEY="$(grep 'Unseal Key 1' "$LEARN_VAULT/.vault_init" | awk '{print $NF}')"

Unseal Vault.

$ vault operator unseal "$UNSEAL_KEY"
Key             Value
---             -----
Seal Type       shamir
Initialized     true
Sealed          false
Total Shares    1
Threshold       1
Version         1.8.0
Storage Type    file
Cluster Name    learn-recovery-server
Cluster ID      5820993a-bde7-b8f9-9894-b5fe07378833
HA Enabled      false

Set the environment variable `ROOT_TOKEN value to that of the initial root token.

$ ROOT_TOKEN=$(grep 'Initial Root Token' "$LEARN_VAULT/.vault_init" | awk '{print $NF}')

Note

For the purpose of this tutorial, you can use the root token to work with Vault. However, you should use root tokens just for initial setup or in emergencies. As a best practice, use tokens with an appropriate set of policies based on your role in the organization.

Authenticate to Vault with the initial root token.

$ vault login -no-print "$ROOT_TOKEN"

Confirm that you've authenticated to Vault with the initial root token by checking that the token has the root policy attached.

$ vault token lookup | grep policies
policies            [root]

You are now prepared to begin the scenario.

Scenario Introduction

To explore a Vault server running in recovery mode, you will perform the following:

Run a Vault server using filesystem storage.
Login with the initial root token, enable an audit device, and enable resource quotas.
Stop the Vault server.
Start the server again in recovery mode.
Generate a recovery mode token, and use that token to perform some basic examination of the storage items through the /sys/raw endpoint.

Enable file audit device and resource quota

You can enable some simple configuration in Vault and an audit device so that you get a better picture of data in Vault later through the lens of recovery mode.

Enable a file audit device with output to the file at $LEARN_VAULT/audit.log.

$ vault audit enable file file_path=$LEARN_VAULT/audit.log
Success! Enabled the file audit device at: file/

Enable a resource quota on the path sys/health to enforce rate limiting of response headers and audit logging.

You will examine this information later as an example of configuration that you can change while in recovery mode for example to unblock from an undesired behavior with the server.

$ vault write /sys/quotas/config \
    rate_limit_exempt_paths=sys/health \
    enable_rate_limit_audit_logging=true \
    enable_rate_limit_response_headers=true

Output:

Success! Data written to: sys/quotas/config

Stop Vault server

Return to the terminal session where you started the Vault server.

Press CTRL+C (or CTRL+BREAK on Windows) to stop the Vault server.

Start server in recovery mode

The /sys/raw API endpoint is not enabled by default. You must start the Vault server in recovery mode, then generate a recovery mode operation token to access the /sys/raw endpoint.

When you have Vault operating in recovery mode, you will then generate a recovery mode operation token, and use that token for all operations in this scenario.

Start Vault server in recovery mode.

$ vault server -config $LEARN_VAULT/vault-server.hcl -recovery

Notice from the output that the server is now running in recovery mode.

==> Vault server configuration:

               Seal Type: shamir
         Cluster Address: http://127.0.0.1:8201
              Go Version: go1.16.5
               Log Level: info
           Recovery Mode: true
                 Storage: file
                 Version: Vault v1.8.0
             Version Sha: 82a99f14eb6133f99a975e653d4dac21c17505c7

==> Vault server started! Log data will stream in below:

2021-08-16T11:21:59.106-0400 [INFO]  proxy environment: http_proxy="" https_proxy="" no_proxy=""

This same information would typically be present in the server logs of a production Vault.

Generate recovery mode operation token

All examples of querying the /sys/raw endpoint demonstrated in this tutorial require the use of a recovery mode operation token. You will generate one to use as an example of the process here with the with vault CLI using vault operator generate root.

Return to the other terminal session where you first authenticated with Vault, and generate a one-time password (OTP).

$ vault operator generate-root -generate-otp -recovery-token
l5T1Uym6Fz5ogWOYTzSBAUj7cD

Use the OTP value to initialize the token generation process.

$ vault operator generate-root -init \
    -otp=l5T1Uym6Fz5ogWOYTzSBAUj7cD \
    -recovery-token

Example output:

Nonce         efbe7aa1-2029-89e0-09c1-a45bd3822d4c
Started       true
Progress      0/1
Complete      false
OTP Length    26

You must pass in a quorum of unseal or recovery keys as necessary to generate an encoded token. For this scenario, you pass in just the single unseal key value.

Set the environment variable UNSEAL_KEY with the unseal key as its value.

$ UNSEAL_KEY="$(grep 'Unseal Key 1' "$LEARN_VAULT/.vault_init" | awk '{print $NF}')"

Generate the encoded token.

$ vault operator generate-root \
    -nonce efbe7aa1-2029-89e0-09c1-a45bd3822d4c \
    -recovery-token $UNSEAL_KEY

Successful output resembles this example, and includes the encoded token.

Nonce            efbe7aa1-2029-89e0-09c1-a45bd3822d4c
Started          true
Progress         1/1
Complete         true
Encoded Token    HhtmRzA9P0ABFVkGIRQaKQAZBQQ1LQRQMDY

Decode the encoded token to generate the recovery mode operation token.

$ vault operator generate-root \
   -decode=HhtmRzA9P0ABFVkGIRQaKQAZBQQ1LQRQMDY \
   -otp=l5T1Uym6Fz5ogWOYTzSBAUj7cD \
   -recovery-token $UNSEAL_KEY

Example output:

r.2veDRvGoliFCUpTcVFtxngSr

Note the prefix for the returned token value is r, designating this a recovery mode operation token.

Use the value of this recovery mode operation token for all examples of listing and reading /sys/raw/... paths throughout the tutorial.

Examine storage paths

First list the top level sys/raw/ path.

$ VAULT_TOKEN=r.2veDRvGoliFCUpTcVFtxngSr vault list sys/raw
Keys
----
core/
logical/
sys/

While Vault encrypts all sensitive secret values, configuration information written to Vault without sensitive content gets stored as plaintext or JSON.

For example, you can find audit device information in the core/audit key, which itself holds a single key named value. You can read the key, and pass its value to jq for a prettier version.

$ VAULT_TOKEN=r.J03W9LmJC4PIo6SHsnGsuShb vault read \
    -field=value \
    sys/raw/core/audit | jq

Example output:

{
  "type": "audit",
  "entries": [
    {
      "table": "audit",
      "path": "file/",
      "type": "file",
      "description": "",
      "uuid": "d2e93952-0eb9-61f5-4f84-eb3f5ca5979b",
      "backend_aware_uuid": "",
      "accessor": "audit_file_4ae38500",
      "config": {},
      "options": {
        "file_path": "/tmp/learn-vault-recovery/audit.log"
      },
      "local": false,
      "seal_wrap": false,
      "namespace_id": "root"
    }
  ]
}

This information corresponds precisely to the file based audit device you enabled earlier.

Tip

When troubleshooting production Vault servers with blocked audit devices, listing this information helps you learn the target file, network port, or socket for the purposes of unblocking the device.

Now list the resource quotas path vault list sys/raw/sys/quotas.

$ VAULT_TOKEN=r.2veDRvGoliFCUpTcVFtxngSr vault list sys/raw/sys/quotas/
Keys
----
config
default_rate_limit_exempt_paths_toggle

The returned keys contain the resource quota configuration for the quota you enabled earlier. Again, there is a single key named value containing the JSON configuration.

$ VAULT_TOKEN=r.2veDRvGoliFCUpTcVFtxngSr vault read \
    -field=value \
     sys/raw/sys/quotas/config | jq

Example output:

{
  "enable_rate_limit_audit_logging": true,
  "enable_rate_limit_response_headers": true,
  "rate_limit_exempt_paths": ["sys/health"]
}

The configuration details match what you wrote earlier in the enable resource quota step before starting Vault in recovery mode.

Most extreme troubleshooting scenarios which require recovery mode typically involve more than listing or reading keys and values. You typically also need to delete particular keys related to the functionality that is blocking operations.

Warning

Exercise extreme caution when using delete or write operations while in recovery mode. Always validate the key name and contents, and have a snapshot from a time before the modifications at hand before performing any operation that writes to the storage. Enterprise users can coordinate with HashiCorp Customer Success for help with this process.

Feel free to explore the other keys and values, and when you finish, you can clean up the scenario environment.

Cleanup

You can clean up from this scenario by following these steps.

From the terminal session where the Vault server is running, press CTRL+C (or CTRL+BREAK on Windows) to stop the server.
Remove the data created in the scenario.
```
$ rm -rf "$LEARN_VAULT"
```

Unset environment variables.

$ unset ROOT_TOKEN UNSEAL_KEY VAULT_ADDR

Unset environment variables in the other terminal.
```
$ unset UNSEAL_KEY VAULT_ADDR
```

Usage tips

Here are some tips to keep in mind when using recovery mode in production.

Always have a recent snapshot available to restore from if you must revert any changes made in recovery mode.
Review the Recovery Mode documentation, which describes the required -recovery runtime configuration flag. You should refer to that documentation before configuring your Vault server startup script to start Vault in recovery mode.
When using the vault CLI, formatting output as JSON with the flag -format=json can often help with listing items which you need to iterate over.
Be sure to update your Vault server startup script to remove -recovery from the flags so that you can start the server for regular operation when recovery mode operation is complete.

Summary

You learned how to operate a Vault server in recovery mode, how to generate and use a recovery mode operation token.

You also learned how to examine information in the low level storage using the recovery operation mode token, with an emphasis on the caution around write operations.

Help and reference

Recover from lost quorum

Audit Vault with Elasticsearch

Challenge

Solution

Prerequisites

Prepare environment

Write the example configuration

Start Vault server

Initialize, unseal, and login

Scenario Introduction

Enable file audit device and resource quota

Stop Vault server

Start server in recovery mode

Generate recovery mode operation token

Examine storage paths

Cleanup

Usage tips

Summary

Help and reference