PostgreSQL 9.0beta1 Documentation | ||||
---|---|---|---|---|
Prev | Fast Backward | Chapter 25. High Availability, Load Balancing, and Replication | Fast Forward | Next |
Hot Standby is the term used to describe the ability to connect to the server and run read-only queries while the server is in archive recovery. This is useful for both log shipping replication and for restoring a backup to an exact state with great precision. The term Hot Standby also refers to the ability of the server to move from recovery through to normal operation while users continue running queries and/or keep their connections open.
Running queries in recovery mode is similar to normal query operation, though there are several usage and administrative differences noted below.
Users can connect to the database server while it is in recovery mode and perform read-only queries. Read-only access to system catalogs and views will also occur as normal.
The data on the standby takes some time to arrive from the primary server so there will be a measurable delay between primary and standby. Running the same query nearly simultaneously on both primary and standby might therefore return differing results. We say that data on the standby is eventually consistent with the primary. Queries executed on the standby will be correct with regard to the transactions that had been recovered at the start of the query, or start of first statement in the case of serializable transactions. In comparison with the primary, the standby returns query results that could have been obtained on the primary at some moment in the past.
When a transaction is started in recovery, the parameter transaction_read_only will be forced to be true, regardless of the default_transaction_read_only setting in postgresql.conf. It can't be manually set to false either. As a result, all transactions started during recovery will be limited to read-only actions. In all other ways, connected sessions will appear identical to sessions initiated during normal processing mode. There are no special commands required to initiate a connection so all interfaces work unchanged. After recovery finishes, the session will allow normal read-write transactions at the start of the next transaction, if these are requested.
"Read-only" above means no writes to the permanent or temporary database tables. There are no problems with queries that use transient sort and work files.
The following actions are allowed:
Query access - SELECT, COPY TO including views and SELECT rules
Cursor commands - DECLARE, FETCH, CLOSE
Parameters - SHOW, SET, RESET
Transaction management commands
BEGIN, END, ABORT, START TRANSACTION
SAVEPOINT, RELEASE, ROLLBACK TO SAVEPOINT
EXCEPTION blocks and other internal subtransactions
LOCK TABLE, though only when explicitly in one of these modes: ACCESS SHARE, ROW SHARE or ROW EXCLUSIVE.
Plans and resources - PREPARE, EXECUTE, DEALLOCATE, DISCARD
Plugins and extensions - LOAD
These actions produce error messages:
Data Manipulation Language (DML) - INSERT, UPDATE, DELETE, COPY FROM, TRUNCATE. Note that there are no allowed actions that result in a trigger being executed during recovery.
Data Definition Language (DDL) - CREATE, DROP, ALTER, COMMENT. This also applies to temporary tables also because currently their definition causes writes to catalog tables.
SELECT ... FOR SHARE | UPDATE which cause row locks to be written
Rules on SELECT statements that generate DML commands.
LOCK that explicitly requests a mode higher than ROW EXCLUSIVE MODE.
LOCK in short default form, since it requests ACCESS EXCLUSIVE MODE.
Transaction management commands that explicitly set non-read-only state:
BEGIN READ WRITE, START TRANSACTION READ WRITE
SET TRANSACTION READ WRITE, SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE
SET transaction_read_only = off
Two-phase commit commands - PREPARE TRANSACTION, COMMIT PREPARED, ROLLBACK PREPARED because even read-only transactions need to write WAL in the prepare phase (the first phase of two phase commit).
Sequence updates - nextval()
, setval()
LISTEN, UNLISTEN, NOTIFY
Note that the current behavior of read only transactions when not in recovery is to allow the last two actions, so there are small and subtle differences in behavior between read-only transactions run on a standby and run during normal operation. It is possible that LISTEN, UNLISTEN, and temporary tables might be allowed in a future release.
If failover or switchover occurs the database will switch to normal processing mode. Sessions will remain connected while the server changes mode. Current transactions will continue, though will remain read-only. After recovery is complete, it will be possible to initiate read-write transactions.
Users will be able to tell whether their session is read-only by issuing SHOW transaction_read_only. In addition, a set of functions (Table 9-57) allow users to access information about the standby server. These allow you to write programs that are aware of the current state of the database. These can be used to monitor the progress of recovery, or to allow you to write complex programs that restore the database to particular states.
In recovery, transactions will not be permitted to take any table lock higher than RowExclusiveLock. In addition, transactions may never assign a TransactionId and may never write WAL. Any LOCK TABLE command that runs on the standby and requests a specific lock mode higher than ROW EXCLUSIVE MODE will be rejected.
In general queries will not experience lock conflicts from the database changes made by recovery. This is because recovery follows normal concurrency control mechanisms, known as MVCC. There are some types of change that will cause conflicts, covered in the following section.
The primary and standby nodes are in many ways loosely connected. Actions on the primary will have an effect on the standby. As a result, there is potential for negative interactions or conflicts between them. The easiest conflict to understand is performance: if a huge data load is taking place on the primary then this will generate a similar stream of WAL records on the standby, so standby queries may contend for system resources, such as I/O.
There are also additional types of conflict that can occur with Hot Standby. These conflicts are hard conflicts in the sense that queries might need to be cancelled and, in some cases, sessions disconnected to resolve them. The user is provided with several ways to handle these conflicts, though it is important to first understand the possible causes of conflicts:
Access Exclusive Locks from primary node, including both explicit LOCK commands and various DDL actions
Dropping tablespaces on the primary while standby queries are using those tablespaces for temporary work files (work_mem overflow)
Dropping databases on the primary while users are connected to that database on the standby.
The standby waiting longer than max_standby_delay to acquire a buffer cleanup lock.
Early cleanup of data still visible to the current query's snapshot
Some WAL redo actions will be for DDL execution. These DDL actions are replaying changes that have already committed on the primary node, so they must not fail on the standby node. These DDL locks take priority and will automatically *cancel* any read-only transactions that get in their way, after a grace period. This is similar to the possibility of being canceled by the deadlock detector. But in this case, the standby recovery process always wins, since the replayed actions must not fail. This also ensures that replication does not fall behind while waiting for a query to complete. This prioritization presumes that the standby exists primarily for high availability, and that adjusting the grace period will allow a sufficient guard against unexpected cancellation.
An example of the above would be an administrator on the primary server running DROP TABLE on a table that is currently being queried on the standby server. Clearly the query cannot continue if DROP TABLE proceeds. If this situation occurred on the primary, the DROP TABLE would wait until the query had finished. When DROP TABLE is run on the primary, the primary doesn't have information about which queries are running on the standby, so it cannot wait for any of the standby queries. The WAL change records come through to the standby while the standby query is still running, causing a conflict.
The most common reason for conflict between standby queries and WAL redo is "early cleanup". Normally, PostgreSQL allows cleanup of old row versions when there are no users who need to see them to ensure correct visibility of data (the heart of MVCC). If there is a standby query that has been running for longer than any query on the primary then it is possible for old row versions to be removed by either a vacuum or HOT. This will then generate WAL records that, if applied, would remove data on the standby that might potentially be required by the standby query. In more technical language, the primary's xmin horizon is later than the standby's xmin horizon, allowing dead rows to be removed.
Experienced users should note that both row version cleanup and row version freezing will potentially conflict with recovery queries. Running a manual VACUUM FREEZE is likely to cause conflicts even on tables with no updated or deleted rows.
There are a number of choices for resolving query conflicts. The default is to wait and hope the query finishes. The server will wait automatically until the lag between primary and standby is at most max_standby_delay seconds. Once that grace period expires, one of the following actions is taken:
If the conflict is caused by a lock, the conflicting standby transaction is cancelled immediately. If the transaction is idle-in-transaction, then the session is aborted instead. This behavior might change in the future.
If the conflict is caused by cleanup records, the standby query is informed a conflict has occurred and that it must cancel itself to avoid the risk that it silently fails to read relevant data because that data has been removed. (This is regrettably similar to the much feared and iconic error message "snapshot too old"). Some cleanup records only conflict with older queries, while others can affect all queries.
If cancellation does occur, the query and/or transaction can always be re-executed. The error is dynamic and will not necessarily reoccur if the query is executed again.
max_standby_delay is set in postgresql.conf. The parameter applies to the server as a whole, so if the delay is consumed by a single query then there may be little or no waiting for queries that follow, though they will have benefited equally from the initial waiting period. The server may take time to catch up again before the grace period is available again, though if there is a heavy and constant stream of conflicts it may seldom catch up fully.
Users should be clear that tables that are regularly and heavily updated on the primary server will quickly cause cancellation of longer running queries on the standby. In those cases max_standby_delay can be considered similar to setting statement_timeout.
Other remedial actions exist if the number of cancellations is unacceptable.
The first option is to connect to the primary server and keep a query active
for as long as needed to run queries on the standby. This guarantees that
a WAL cleanup record is never generated and query conflicts do not occur,
as described above. This could be done using contrib/dblink
and pg_sleep()
, or via other mechanisms. If you do this, you
should note that this will delay cleanup of dead rows on the primary by
vacuum or HOT, and people might find this undesirable. However, remember
that the primary and standby nodes are linked via the WAL, so the cleanup
situation is no different from the case where the query ran on the primary
node itself. And you are still getting the benefit of off-loading the
execution onto the standby. max_standby_delay should
not be used in this case because delayed WAL files might already
contain entries that invalidate the current shapshot.
It is also possible to set vacuum_defer_cleanup_age on the primary to defer the cleanup of records by autovacuum, VACUUM and HOT. This might allow more time for queries to execute before they are cancelled on the standby, without the need for setting a high max_standby_delay.
Three-way deadlocks are possible between AccessExclusiveLocks arriving from the primary, cleanup WAL records that require buffer cleanup locks, and user requests that are waiting behind replayed AccessExclusiveLocks. Deadlocks are resolved immediately, should they occur, though they are thought to be rare in practice.
Dropping tablespaces or databases is discussed in the administrator's section since they are not typical user situations.
If hot_standby is turned on in postgresql.conf and there is a recovery.conf file present, the server will run in Hot Standby mode. However, it may take some time for Hot Standby connections to be allowed, because the server will not accept connections until it has completed sufficient recovery to provide a consistent state against which queries can run. During this period, clients that attempt to connect will be refused with an error message. To confirm the server has come up, either loop trying to connect from the application, or look for these messages in the server logs:
LOG: entering standby mode ... then some time later ... LOG: consistent recovery state reached LOG: database system is ready to accept read only connections
Consistency information is recorded once per checkpoint on the primary. It is not possible to enable hot standby when reading WAL written during a period when wal_level was not set to hot_standby on the primary. Reaching a consistent state can also be delayed in the presence of both of these conditions:
A write transaction has more than 64 subtransactions
Very long-lived write transactions
If you are running file-based log shipping ("warm standby"), you might need to wait until the next WAL file arrives, which could be as long as the archive_timeout setting on the primary.
The setting of some parameters on the standby will need reconfiguration if they have been changed on the primary. For these parameters, the value on the standby must be equal to or greater than the value on the primary. If these parameters are not set high enough then the standby will refuse to start. Higher values can then be supplied and the server restarted to begin recovery again. These parameters are:
max_connections
max_prepared_transactions
max_locks_per_transaction
It is important that the administrator consider the appropriate setting of max_standby_delay, which can be set in postgresql.conf. There is no optimal setting, so it should be set according to business priorities. For example if the server is primarily tasked as a High Availability server, then you may wish to lower max_standby_delay or even set it to zero, though that is a very aggressive setting. If the standby server is tasked as an additional server for decision support queries then it might be acceptable to set this to a value of many hours (in seconds). It is also possible to set max_standby_delay to -1 which means wait forever for queries to complete; this will be useful when performing an archive recovery from a backup.
Transaction status "hint bits" written on the primary are not WAL-logged, so data on the standby will likely re-write the hints again on the standby. Thus, the standby server will still perform disk writes even though all users are read-only; no changes occur to the data values themselves. Users will still write large sort temporary files and re-generate relcache info files, so no part of the database is truly read-only during hot standby mode. Note also that writes to remote databases will still be possible, even though the transaction is read-only locally.
The following types of administration commands are not accepted during recovery mode:
Data Definition Language (DDL) - e.g. CREATE INDEX
Privilege and Ownership - GRANT, REVOKE, REASSIGN
Maintenance commands - ANALYZE, VACUUM, CLUSTER, REINDEX
Again, note that some of these commands are actually allowed during "read only" mode transactions on the primary.
As a result, you cannot create additional indexes that exist solely on the standby, nor statistics that exist solely on the standby. If these administration commands are needed, they should be executed on the primary, and eventually those changes will propagate to the standby.
pg_cancel_backend()
will work on user backends, but not the
Startup process, which performs recovery. pg_stat_activity does not
show an entry for the Startup process, nor do recovering transactions
show as active. As a result, pg_prepared_xacts is always empty during
recovery. If you wish to resolve in-doubt prepared transactions,
view pg_prepared_xacts on the primary and issue commands to
resolve transactions there.
pg_locks will show locks held by backends, as normal. pg_locks also shows a virtual transaction managed by the Startup process that owns all AccessExclusiveLocks held by transactions being replayed by recovery. Note that the Startup process does not acquire locks to make database changes, and thus locks other than AccessExclusiveLocks do not show in pg_locks for the Startup process; they are just presumed to exist.
The Nagios plugin check_pgsql will work, because the simple information it checks for exists. The check_postgres monitoring script will also work, though some reported values could give different or confusing results. For example, last vacuum time will not be maintained, since no vacuum occurs on the standby. Vacuums running on the primary do still send their changes to the standby.
WAL file control commands will not work during recovery,
e.g. pg_start_backup
, pg_switch_xlog
etc.
Dynamically loadable modules work, including pg_stat_statements.
Advisory locks work normally in recovery, including deadlock detection. Note that advisory locks are never WAL logged, so it is impossible for an advisory lock on either the primary or the standby to conflict with WAL replay. Nor is it possible to acquire an advisory lock on the primary and have it initiate a similar advisory lock on the standby. Advisory locks relate only to the server on which they are acquired.
Trigger-based replication systems such as Slony, Londiste and Bucardo won't run on the standby at all, though they will run happily on the primary server as long as the changes are not sent to standby servers to be applied. WAL replay is not trigger-based so you cannot relay from the standby to any system that requires additional database writes or relies on the use of triggers.
New oids cannot be assigned, though some UUID generators may still work as long as they do not rely on writing new status to the database.
Currently, temporary table creation is not allowed during read only transactions, so in some cases existing scripts will not run correctly. This restriction might be relaxed in a later release. This is both a SQL Standard compliance issue and a technical issue.
DROP TABLESPACE can only succeed if the tablespace is empty. Some standby users may be actively using the tablespace via their temp_tablespaces parameter. If there are temporary files in the tablespace, all active queries are cancelled to ensure that temporary files are removed, so the tablespace can be removed and WAL replay can continue.
Running DROP DATABASE, ALTER DATABASE ... SET TABLESPACE, or ALTER DATABASE ... RENAME on primary will generate a log message that will cause all users connected to that database on the standby to be forcibly disconnected. This action occurs immediately, whatever the setting of max_standby_delay.
In normal (non-recovery) mode, if you issue DROP USER or DROP ROLE for a role with login capability while that user is still connected then nothing happens to the connected user - they remain connected. The user cannot reconnect however. This behavior applies in recovery also, so a DROP USER on the primary does not disconnect that user on the standby.
The statististics collector is active during recovery. All scans, reads, blocks, index usage, etc., will be recorded normally on the standby. Replayed actions will not duplicate their effects on primary, so replaying an insert will not increment the Inserts column of pg_stat_user_tables. The stats file is deleted at the start of recovery, so stats from primary and standby will differ; this is considered a feature, not a bug.
Autovacuum is not active during recovery, it will start normally at the end of recovery.
The background writer is active during recovery and will perform restartpoints (similar to checkpoints on the primary) and normal block cleaning activities. This can include updates of the hint bit information stored on the standby server. The CHECKPOINT command is accepted during recovery, though it performs a restartpoint rather than a new checkpoint.
Various parameters have been mentioned above in Section 25.5.3 and Section 25.5.2.
On the primary, parameters wal_level and vacuum_defer_cleanup_age can be used. max_standby_delay has no effect if set on the primary.
On the standby, parameters hot_standby and max_standby_delay can be used. vacuum_defer_cleanup_age has no effect during recovery.
There are several limitations of Hot Standby. These can and probably will be fixed in future releases:
Operations on hash indexes are not presently WAL-logged, so replay will not update these indexes. Hash indexes will not be used for query plans during recovery.
Full knowledge of running transactions is required before snapshots can be taken. Transactions that use large numbers of subtransactions (currently greater than 64) will delay the start of read only connections until the completion of the longest running write transaction. If this situation occurs, explanatory messages will be sent to the server log.
Valid starting points for standby queries are generated at each checkpoint on the master. If the standby is shut down while the master is in a shutdown state, it might not be possible to re-enter Hot Standby until the primary is started up, so that it generates further starting points in the WAL logs. This situation isn't a problem in the most common situations where it might happen. Generally, if the primary is shut down and not available anymore, that's likely due to a serious failure that requires the standby being converted to operate as the new primary anyway. And in situations where the primary is being intentionally taken down, coordinating to make sure the standby becomes the new primary smoothly is also standard procedure.
At the end of recovery, AccessExclusiveLocks held by prepared transactions will require twice the normal number of lock table entries. If you plan on running either a large number of concurrent prepared transactions that normally take AccessExclusiveLocks, or you plan on having one large transaction that takes many AccessExclusiveLocks, you are advised to select a larger value of max_locks_per_transaction, perhaps as much as twice the value of the parameter on the primary server. You need not consider this at all if your setting of max_prepared_transactions is 0.