Operate Vault in recovery mode
Challenge
In exceptional circumstances, you can face the need to troubleshoot issues with a Vault server, such as configuration changes which cause it to become unavailable for general use.
Recovery via snapshot is not always a viable solution in such extreme cases, often because the issue root cause can prevent a Vault server from starting or servicing user requests.
Diagnosing and resolving such exceptional outage states can require that you access the storage at a low level that is impossible with a running Vault cluster.
Solution
Users of Vault version 1.3.0 or higher can operate Vault in recovery mode to troubleshoot and recover from some extreme circumstances when other methods are unavailable.
Recovery mode allows for direct low level interaction with raw portions of the internal storage for any supported storage type.
Vault limits recovery mode operation to list, read, delete and write operations against keys and values contained under the root path /sys/raw/
.
While operating in recovery mode, Vault is not available for responding to standard user requests, and instead just provides the minimum functionality required for maintenance and recovery purposes.
You can learn more about operating Vault in recovery mode by following the lab in this tutorial.
Warning
Ensure you have a backup or snapshot of the Vault server data before using any of the information from this tutorial in a live setting.
Prerequisites
To perform the steps in this tutorial, you need Vault. The Community Edition is suitable for this tutorial.
The Vault foundations tutorials is a great starting point if you are not familiar with Vault.
Some examples use, but do not necessarily require jq for formatting JSON output.
Prepare environment
Create a temporary directory to contain the work you will do in this scenario, and assign its path to the environment variable LEARN_VAULT
.
Write the example configuration
You will begin the scenario with the example configuration file, vault-server.hcl
.
Write it to the scenario home directory.
Insecure operation
The listener stanza disables TLS (tls_disable = "true"
). In production, Vault should always use
TLS to enable secure communication between clients and the Vault server. It requires a certificate file and key file on each Vault host.
Start Vault server
Initialize, unseal, and login
In another terminal session, export the VAULT_ADDR
environment variable to address the Vault server.
Initialize Vault, and write initialization output to the file named .vault_init
in the temporary scenario directory specified by $LEARN_VAULT
.
Insecure operation
Do not run an unsealed Vault in production with a single key share and a single key threshold. This approach is just used here to simplify the unsealing process for this demonstration.
Set the environment variable UNSEAL_KEY
with the unseal key as its value.
Unseal Vault.
Set the environment variable `ROOT_TOKEN
value to that of the initial root token.
Note
For the purpose of this tutorial, you can use the root
token to work with Vault. However, you should use root tokens just for initial setup or in emergencies. As a best practice, use tokens with an appropriate set of policies based on your role in the organization.
Authenticate to Vault with the initial root token.
Confirm that you've authenticated to Vault with the initial root token by checking that the token has the root policy attached.
You are now prepared to begin the scenario.
Scenario Introduction
To explore a Vault server running in recovery mode, you will perform the following:
- Run a Vault server using filesystem storage.
- Login with the initial root token, enable an audit device, and enable resource quotas.
- Stop the Vault server.
- Start the server again in recovery mode.
- Generate a recovery mode token, and use that token to perform some basic examination of the storage items through the
/sys/raw
endpoint.
Enable file audit device and resource quota
You can enable some simple configuration in Vault and an audit device so that you get a better picture of data in Vault later through the lens of recovery mode.
Enable a file audit device with output to the file at $LEARN_VAULT/audit.log
.
Enable a resource quota on the path sys/health
to enforce rate limiting of response headers and audit logging.
You will examine this information later as an example of configuration that you can change while in recovery mode for example to unblock from an undesired behavior with the server.
Output:
Stop Vault server
Return to the terminal session where you started the Vault server.
Press CTRL+C
(or CTRL+BREAK
on Windows) to stop the Vault server.
Start server in recovery mode
The /sys/raw API endpoint is not enabled by default. You must start the Vault server in recovery mode, then generate a recovery mode operation token to access the /sys/raw
endpoint.
When you have Vault operating in recovery mode, you will then generate a recovery mode operation token, and use that token for all operations in this scenario.
Start Vault server in recovery mode.
Notice from the output that the server is now running in recovery mode.
This same information would typically be present in the server logs of a production Vault.
Generate recovery mode operation token
All examples of querying the /sys/raw
endpoint demonstrated in this tutorial require the use of a recovery mode operation token. You will generate one to use as an example of the process here with the with vault
CLI using vault operator generate root
.
Return to the other terminal session where you first authenticated with Vault, and generate a one-time password (OTP).
Use the OTP value to initialize the token generation process.
Example output:
You must pass in a quorum of unseal or recovery keys as necessary to generate an encoded token. For this scenario, you pass in just the single unseal key value.
Set the environment variable UNSEAL_KEY
with the unseal key as its value.
Generate the encoded token.
Successful output resembles this example, and includes the encoded token.
Decode the encoded token to generate the recovery mode operation token.
Example output:
Note the prefix for the returned token value is r, designating this a recovery mode operation token.
Use the value of this recovery mode operation token for all examples of listing and reading /sys/raw/...
paths throughout the tutorial.
Examine storage paths
First list the top level sys/raw/
path.
While Vault encrypts all sensitive secret values, configuration information written to Vault without sensitive content gets stored as plaintext or JSON.
For example, you can find audit device information in the core/audit
key, which itself holds a single key named value
. You can read the key, and pass its value to jq
for a prettier version.
Example output:
This information corresponds precisely to the file based audit device you enabled earlier.
Tip
When troubleshooting production Vault servers with blocked audit devices, listing this information helps you learn the target file, network port, or socket for the purposes of unblocking the device.
Now list the resource quotas path vault list sys/raw/sys/quotas
.
The returned keys contain the resource quota configuration for the quota you enabled earlier. Again, there is a single key named value
containing the JSON configuration.
Example output:
The configuration details match what you wrote earlier in the enable resource quota step before starting Vault in recovery mode.
Most extreme troubleshooting scenarios which require recovery mode typically involve more than listing or reading keys and values. You typically also need to delete particular keys related to the functionality that is blocking operations.
Warning
Exercise extreme caution when using delete or write operations while in recovery mode. Always validate the key name and contents, and have a snapshot from a time before the modifications at hand before performing any operation that writes to the storage. Enterprise users can coordinate with HashiCorp Customer Success for help with this process.
Feel free to explore the other keys and values, and when you finish, you can clean up the scenario environment.
Cleanup
You can clean up from this scenario by following these steps.
From the terminal session where the Vault server is running, press
CTRL+C
(orCTRL+BREAK
on Windows) to stop the server.Remove the data created in the scenario.
Unset environment variables.
Unset environment variables in the other terminal.
Usage tips
Here are some tips to keep in mind when using recovery mode in production.
Always have a recent snapshot available to restore from if you must revert any changes made in recovery mode.
Review the Recovery Mode documentation, which describes the required
-recovery
runtime configuration flag. You should refer to that documentation before configuring your Vault server startup script to start Vault in recovery mode.When using the
vault
CLI, formatting output as JSON with the flag-format=json
can often help with listing items which you need to iterate over.Be sure to update your Vault server startup script to remove
-recovery
from the flags so that you can start the server for regular operation when recovery mode operation is complete.
Summary
You learned how to operate a Vault server in recovery mode, how to generate and use a recovery mode operation token.
You also learned how to examine information in the low level storage using the recovery operation mode token, with an emphasis on the caution around write operations.