WAL documentation, from Oliver Elphick and Vadim Mikheev.
This commit is contained in:
parent
43bac8406a
commit
7b9dc71405
@ -1,5 +1,5 @@
|
|||||||
<!--
|
<!--
|
||||||
$Header: /cvsroot/pgsql/doc/src/sgml/Attic/admin.sgml,v 1.30 2001/01/24 19:42:46 momjian Exp $
|
$Header: /cvsroot/pgsql/doc/src/sgml/Attic/admin.sgml,v 1.31 2001/01/24 23:15:19 petere Exp $
|
||||||
-->
|
-->
|
||||||
|
|
||||||
<book id="admin">
|
<book id="admin">
|
||||||
@ -58,6 +58,7 @@ $Header: /cvsroot/pgsql/doc/src/sgml/Attic/admin.sgml,v 1.30 2001/01/24 19:42:46
|
|||||||
&manage-ag;
|
&manage-ag;
|
||||||
&user-manag;
|
&user-manag;
|
||||||
&backup;
|
&backup;
|
||||||
|
&wal;
|
||||||
&recovery;
|
&recovery;
|
||||||
®ress;
|
®ress;
|
||||||
&release;
|
&release;
|
||||||
|
@ -1,4 +1,4 @@
|
|||||||
<!-- $Header: /cvsroot/pgsql/doc/src/sgml/filelist.sgml,v 1.5 2001/01/22 23:34:32 petere Exp $ -->
|
<!-- $Header: /cvsroot/pgsql/doc/src/sgml/filelist.sgml,v 1.6 2001/01/24 23:15:19 petere Exp $ -->
|
||||||
|
|
||||||
<!entity about SYSTEM "about.sgml">
|
<!entity about SYSTEM "about.sgml">
|
||||||
<!entity history SYSTEM "history.sgml">
|
<!entity history SYSTEM "history.sgml">
|
||||||
@ -54,6 +54,7 @@
|
|||||||
<!entity release SYSTEM "release.sgml">
|
<!entity release SYSTEM "release.sgml">
|
||||||
<!entity runtime SYSTEM "runtime.sgml">
|
<!entity runtime SYSTEM "runtime.sgml">
|
||||||
<!entity user-manag SYSTEM "user-manag.sgml">
|
<!entity user-manag SYSTEM "user-manag.sgml">
|
||||||
|
<!entity wal SYSTEM "wal.sgml">
|
||||||
|
|
||||||
<!-- programmer's guide -->
|
<!-- programmer's guide -->
|
||||||
<!entity arch-pg SYSTEM "arch-pg.sgml">
|
<!entity arch-pg SYSTEM "arch-pg.sgml">
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
<!--
|
<!--
|
||||||
$Header: /cvsroot/pgsql/doc/src/sgml/runtime.sgml,v 1.47 2001/01/24 15:19:36 momjian Exp $
|
$Header: /cvsroot/pgsql/doc/src/sgml/runtime.sgml,v 1.48 2001/01/24 23:15:19 petere Exp $
|
||||||
-->
|
-->
|
||||||
|
|
||||||
<Chapter Id="runtime">
|
<Chapter Id="runtime">
|
||||||
@ -1159,6 +1159,57 @@ env PGOPTIONS='-c geqo=off' psql
|
|||||||
</para>
|
</para>
|
||||||
</sect2>
|
</sect2>
|
||||||
|
|
||||||
|
<sect2 id="runtime-config-wal">
|
||||||
|
<title>WAL</title>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
See also <xref linkend="wal-configuration"> for details on WAL
|
||||||
|
tuning.
|
||||||
|
|
||||||
|
<variablelist>
|
||||||
|
<varlistentry>
|
||||||
|
<term>CHECKPOINT_TIMEOUT (<type>integer</type>)</term>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
Frequency of automatic WAL checkpoints, in seconds.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
|
<varlistentry>
|
||||||
|
<term>WAL_BUFFERS (<type>integer</type>)</term>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
Number of buffers for WAL. This option can only be set at
|
||||||
|
server start.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
|
<varlistentry>
|
||||||
|
<term>WAL_DEBUG (<type>integer</type>)</term>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
If non-zero, turn on WAL-related debugging output on standard
|
||||||
|
error.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
|
||||||
|
<varlistentry>
|
||||||
|
<term>WAL_FILES (<type>integer</type>)</term>
|
||||||
|
<listitem>
|
||||||
|
<para>
|
||||||
|
Number of log files that are created in advance at checkpoint
|
||||||
|
time. This option can only be set at server start.
|
||||||
|
</para>
|
||||||
|
</listitem>
|
||||||
|
</varlistentry>
|
||||||
|
</variablelist>
|
||||||
|
</para>
|
||||||
|
</sect2>
|
||||||
|
|
||||||
|
|
||||||
<sect2 id="runtime-config-short">
|
<sect2 id="runtime-config-short">
|
||||||
<title>Short options</title>
|
<title>Short options</title>
|
||||||
<para>
|
<para>
|
||||||
|
321
doc/src/sgml/wal.sgml
Normal file
321
doc/src/sgml/wal.sgml
Normal file
@ -0,0 +1,321 @@
|
|||||||
|
<!-- $Header: /cvsroot/pgsql/doc/src/sgml/wal.sgml,v 1.1 2001/01/24 23:15:19 petere Exp $ -->
|
||||||
|
|
||||||
|
<chapter id="wal">
|
||||||
|
<title>Write-Ahead Logging (<acronym>WAL</acronym>)</title>
|
||||||
|
|
||||||
|
<note>
|
||||||
|
<title>Author</title>
|
||||||
|
<para>
|
||||||
|
Vadim Mikheev and Oliver Elphick
|
||||||
|
</para>
|
||||||
|
</note>
|
||||||
|
|
||||||
|
<sect1 id="wal-general">
|
||||||
|
<title>General Description</Title>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<firstterm>Write Ahead Logging</firstterm> (<acronym>WAL</acronym>)
|
||||||
|
is a standard approach to transaction logging. Its detailed
|
||||||
|
description may be found in most (if not all) books about
|
||||||
|
transaction processing. Briefly, <acronym>WAL</acronym>'s central
|
||||||
|
concept is that changes to data files (where tables and indices
|
||||||
|
reside) must be written only after those changes have been logged -
|
||||||
|
that is, when log records have been flushed to permanent
|
||||||
|
storage. When we follow this procedure, we do not need to flush
|
||||||
|
data pages to disk on every transaction commit, because we know
|
||||||
|
that in the event of a crash we will be able to recover the
|
||||||
|
database using the log: any changes that have not been applied to
|
||||||
|
the data pages will first be redone from the log records (this is
|
||||||
|
roll-forward recovery, also known as REDO) and then changes made by
|
||||||
|
uncommitted transactions will be removed from the data pages
|
||||||
|
(roll-backward recovery - UNDO).
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<sect2 id="wal-benefits-now">
|
||||||
|
<title>Immediate Benefits of <acronym>WAL</acronym></title>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The first obvious benefit of using <acronym>WAL</acronym> is a
|
||||||
|
significantly reduced number of disk writes, since only the log
|
||||||
|
file needs to be flushed to disk at the time of transaction
|
||||||
|
commit; in multi-user environments, commits of many transactions
|
||||||
|
may be accomplished with a single <function>fsync()</function> of
|
||||||
|
the log file. Furthermore, the log file is written sequentially,
|
||||||
|
and so the cost of syncing the log is much less than the cost of
|
||||||
|
flushing the data pages.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The next benefit is consistency of the data pages. The truth is
|
||||||
|
that, before <acronym>WAL</acronym>,
|
||||||
|
<productname>PostgreSQL</productname> was never able to guarantee
|
||||||
|
consistency in the case of a crash. Before
|
||||||
|
<acronym>WAL</acronym>, any crash during writing could result in:
|
||||||
|
|
||||||
|
<orderedlist>
|
||||||
|
<listitem>
|
||||||
|
<simpara>index tuples pointing to non-existent table rows</simpara>
|
||||||
|
</listitem>
|
||||||
|
|
||||||
|
<listitem>
|
||||||
|
<simpara>index tuples lost in split operations</simpara>
|
||||||
|
</listitem>
|
||||||
|
|
||||||
|
<listitem>
|
||||||
|
<simpara>totally corrupted table or index page content, because
|
||||||
|
of partially written data pages</simpara>
|
||||||
|
</listitem>
|
||||||
|
</orderedlist>
|
||||||
|
|
||||||
|
Problems with indices (problems 1 and 2) could possibly have been
|
||||||
|
fixed by additional <function>fsync()</function> calls, but it is
|
||||||
|
not obvious how to handle the last case without
|
||||||
|
<acronym>WAL</acronym>; <acronym>WAL</acronym> saves the entire
|
||||||
|
data page content in the log if that is required to ensure page
|
||||||
|
consistency for after-crash recovery.
|
||||||
|
</para>
|
||||||
|
</sect2>
|
||||||
|
|
||||||
|
<sect2 id="wal-benefits-later">
|
||||||
|
<title>Future Benefits</title>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
In this first release of <acronym>WAL</acronym>, UNDO operation is
|
||||||
|
not implemented, because of lack of time. This means that changes
|
||||||
|
made by aborted transactions will still occupy disk space and that
|
||||||
|
we still need a permanent <filename>pg_log</filename> file to hold
|
||||||
|
the status of transactions, since we are not able to re-use
|
||||||
|
transaction identifiers. Once UNDO is implemented,
|
||||||
|
<filename>pg_log</filename> will no longer be required to be
|
||||||
|
permanent; it will be possible to remove
|
||||||
|
<filename>pg_log</filename> at shutdown, split it into segments
|
||||||
|
and remove old segments.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
With UNDO, it will also be possible to implement
|
||||||
|
<firstterm>savepoints</firstterm> to allow partial rollback of
|
||||||
|
invalid transaction operations (parser errors caused by mistyping
|
||||||
|
commands, insertion of duplicate primary/unique keys and so on)
|
||||||
|
with the ability to continue or commit valid operations made by
|
||||||
|
the transaction before the error. At present, any error will
|
||||||
|
invalidate the whole transaction and require a transaction abort.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<acronym>WAL</acronym> offers the opportunity for a new method for
|
||||||
|
database on-line backup and restore (<acronym>BAR</acronym>). To
|
||||||
|
use this method, one would have to make periodic saves of data
|
||||||
|
files to another disk, a tape or another host and also archive the
|
||||||
|
<acronym>WAL</acronym> log files. The database file copy and the
|
||||||
|
archived log files could be used to restore just as if one were
|
||||||
|
restoring after a crash. Each time a new database file copy was
|
||||||
|
made the old log files could be removed. Implementing this
|
||||||
|
facility will require the logging of data file and index creation
|
||||||
|
and deletion; it will also require development of a method for
|
||||||
|
copying the data files (operating system copy commands are not
|
||||||
|
suitable).
|
||||||
|
</para>
|
||||||
|
</sect2>
|
||||||
|
</sect1>
|
||||||
|
|
||||||
|
<sect1 id="wal-implementation">
|
||||||
|
<title>Implementation</title>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<acronym>WAL</acronym> is automatically enabled from release 7.1
|
||||||
|
onwards. No action is required from the administrator with the
|
||||||
|
exception of ensuring that the additional disk-space requirements
|
||||||
|
of the <acronym>WAL</acronym> logs are met, and that any necessary
|
||||||
|
tuning is done (see <xref linkend="wal-configuration">).
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<acronym>WAL</acronym> logs are stored in the directory
|
||||||
|
<Filename><replaceable>$PGDATA</replaceable>/pg_xlog</Filename>, as
|
||||||
|
a set of segment files, each 16 MB in size. Each segment is
|
||||||
|
divided into 8 kB pages. The log record headers are described in
|
||||||
|
<filename>access/xlog.h</filename>; record content is dependent on
|
||||||
|
the type of event that is being logged. Segment files are given
|
||||||
|
sequential numbers as names, starting at
|
||||||
|
<filename>0000000000000000</filename>. The numbers do not wrap, at
|
||||||
|
present, but it should take a very long time to exhaust the
|
||||||
|
available stock of numbers.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The <acronym>WAL</acronym> buffers and control structure are in
|
||||||
|
shared memory, and are handled by the backends; they are protected
|
||||||
|
by spinlocks. The demand on shared memory is dependent on the
|
||||||
|
number of buffers; the default size of the <acronym>WAL</acronym>
|
||||||
|
buffers is 64 kB.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
It is of advantage if the log is located on another disk than the
|
||||||
|
main database files. This may be achieved by moving the directory,
|
||||||
|
<filename>pg_xlog</filename>, to another location (while the
|
||||||
|
postmaster is shut down, of course) and creating a symbolic link
|
||||||
|
from the original location in <replaceable>$PGDATA</replaceable> to
|
||||||
|
the new location.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The aim of <acronym>WAL</acronym>, to ensure that the log is
|
||||||
|
written before database records are altered, may be subverted by
|
||||||
|
disk drives that falsely report a successful write to the kernel,
|
||||||
|
when, in fact, they have only cached the data and not yet stored it
|
||||||
|
on the disk. A power failure in such a situation may still lead to
|
||||||
|
irrecoverable data corruption; administrators should try to ensure
|
||||||
|
that disks holding <productname>PostgreSQL</productname>'s data and
|
||||||
|
log files do not make such false reports.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<sect2 id="wal-recovery">
|
||||||
|
<title>Database Recovery with <acronym>WAL</acronym></title>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
After a checkpoint has been made and the log flushed, the
|
||||||
|
checkpoint's position is saved in the file
|
||||||
|
<filename>pg_control</filename>. Therefore, when recovery is to be
|
||||||
|
done, the backend first reads <filename>pg_control</filename> and
|
||||||
|
then the checkpoint record; next it reads the redo record, whose
|
||||||
|
position is saved in the checkpoint, and begins the REDO operation.
|
||||||
|
Because the entire content of the pages is saved in the log on the
|
||||||
|
first page modification after a checkpoint, the pages will be first
|
||||||
|
restored to a consistent state.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Using <filename>pg_control</filename> to get the checkpoint
|
||||||
|
position speeds up the recovery process, but to handle possible
|
||||||
|
corruption of <filename>pg_control</filename>, we should actually
|
||||||
|
implement the reading of existing log segments in reverse order --
|
||||||
|
newest to oldest -- in order to find the last checkpoint. This has
|
||||||
|
not yet been done in release 7.1.
|
||||||
|
</para>
|
||||||
|
</sect2>
|
||||||
|
</sect1>
|
||||||
|
|
||||||
|
<sect1 id="wal-configuration">
|
||||||
|
<title><acronym>WAL</acronym> Configuration</title>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
There are several <acronym>WAL</acronym>-related parameters that
|
||||||
|
affect database performance. This section explains their use.
|
||||||
|
Consult <xref linkend="runtime-config"> for details about setting
|
||||||
|
configuration parameters.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
There are two commonly used <acronym>WAL</acronym> functions:
|
||||||
|
<function>LogInsert</function> and <function>LogFlush</function>.
|
||||||
|
<function>LogInsert</function> is used to place a new record into
|
||||||
|
the <acronym>WAL</acronym> buffers in shared memory. If there is no
|
||||||
|
space for the new record, <function>LogInsert</function> will have
|
||||||
|
to write (move to kernel cache) a few filled <acronym>WAL</acronym>
|
||||||
|
buffers. This is undesirable because <function>LogInsert</function>
|
||||||
|
is used on every database low level modification (for example,
|
||||||
|
tuple insertion) at a time when an exclusive lock is held on
|
||||||
|
affected data pages and the operation is supposed to be as fast as
|
||||||
|
possible; what is worse, writing <acronym>WAL</acronym> buffers may
|
||||||
|
also cause the creation of a new log segment, which takes even more
|
||||||
|
time. Normally, <acronym>WAL</acronym> buffers should be written
|
||||||
|
and flushed by a <function>LogFlush</function> request, which is
|
||||||
|
made, for the most part, at transaction commit time to ensure that
|
||||||
|
transaction records are flushed to permanent storage. On systems
|
||||||
|
with high log output, <function>LogFlush</function> requests may
|
||||||
|
not occur often enough to prevent <acronym>WAL</acronym> buffers
|
||||||
|
being written by <function>LogInsert</function>. On such systems
|
||||||
|
one should increase the number of <acronym>WAL</acronym> buffers by
|
||||||
|
modifying the <varname>WAL_BUFFERS</varname> parameter. The default
|
||||||
|
number of <acronym>WAL</acronym> buffers is 8. Increasing this
|
||||||
|
value will have an impact on shared memory usage.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
<firstterm>Checkpoints</firstterm> are points in the sequence of
|
||||||
|
transactions at which it is guaranteed that the data files have
|
||||||
|
been updated with all information logged before the checkpoint. At
|
||||||
|
checkpoint time, all dirty data pages are flushed to disk and a
|
||||||
|
special checkpoint record is written to the log file. As result, in
|
||||||
|
the event of a crash, the recoverer knows from what record in the
|
||||||
|
log (known as the redo record) it should start the REDO operation,
|
||||||
|
since any changes made to data files before that record are already
|
||||||
|
on disk. After a checkpoint has been made, any log segments written
|
||||||
|
before the redo record are removed, so checkpoints are used to free
|
||||||
|
disk space in the <acronym>WAL</acronym> directory. (When
|
||||||
|
<acronym>WAL</acronym>-based <acronym>BAR</acronym> is implemented,
|
||||||
|
the log segments can be archived instead of just being removed.)
|
||||||
|
The checkpoint maker is also able to create a few log segments for
|
||||||
|
future use, so as to avoid the need for
|
||||||
|
<function>LogInsert</function> or <function>LogFlush</function> to
|
||||||
|
spend time in creating them.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The <acronym>WAL</acronym> log is held on the disk as a set of 16
|
||||||
|
MB files called <firstterm>segments</firstterm>. By default a new
|
||||||
|
segment is created only if more than 75% of the current segment is
|
||||||
|
used. One can instruct the server to create up to 64 log segments
|
||||||
|
at checkpoint time by modifying the <varname>WAL_FILES</varname>
|
||||||
|
configuration parameter.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
For faster after-crash recovery, it would be better to create
|
||||||
|
checkpoints more often. However, one should balance this against
|
||||||
|
the cost of flushing dirty data pages; in addition, to ensure data
|
||||||
|
page consistency, the first modification of a data page after each
|
||||||
|
checkpoint results in logging the entire page content, thus
|
||||||
|
increasing output to log and the log's size.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
By default, the postmaster spawns a special backend process to
|
||||||
|
create the next checkpoint 300 seconds after the previous
|
||||||
|
checkpoint's creation. One can change this interval by modifying
|
||||||
|
the <varname>CHECKPOINT_TIMEOUT</varname> parameter. It is also
|
||||||
|
possible to force a checkpoint by using the SQL command
|
||||||
|
<command>CHECKPOINT</command>.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
Setting the <varname>WAL_DEBUG</varname> parameter to any non-zero
|
||||||
|
value will result in each <function>LogInsert</function> and
|
||||||
|
<function>LogFlush</function> <acronym>WAL</acronym> call being
|
||||||
|
logged to standard error. At present, it makes no difference what
|
||||||
|
the non-zero value is. This option may be replaced by a more
|
||||||
|
general mechanism in the future.
|
||||||
|
</para>
|
||||||
|
|
||||||
|
<para>
|
||||||
|
The <varname>COMMIT_DELAY</varname> parameter defines for how long
|
||||||
|
the backend will be forced to sleep after writing a commit record
|
||||||
|
to the log with <function>LogInsert</function> call but before
|
||||||
|
performing a <function>LogFlush</function>. This delay allows other
|
||||||
|
backends to add their commit records to the log so as to have all
|
||||||
|
of them flushed with a single log sync. Unfortunately, this
|
||||||
|
mechanism is not fully implemented at release 7.1, so there is at
|
||||||
|
present no point in changing this parameter from its default value
|
||||||
|
of 5 microseconds.
|
||||||
|
</para>
|
||||||
|
</sect1>
|
||||||
|
</chapter>
|
||||||
|
|
||||||
|
<!-- Keep this comment at the end of the file
|
||||||
|
Local variables:
|
||||||
|
mode:sgml
|
||||||
|
sgml-omittag:nil
|
||||||
|
sgml-shorttag:t
|
||||||
|
sgml-minimize-attributes:nil
|
||||||
|
sgml-always-quote-attributes:t
|
||||||
|
sgml-indent-step:1
|
||||||
|
sgml-indent-data:t
|
||||||
|
sgml-parent-document:nil
|
||||||
|
sgml-default-dtd-file:"./reference.ced"
|
||||||
|
sgml-exposed-tags:nil
|
||||||
|
sgml-local-catalogs:("/usr/lib/sgml/catalog")
|
||||||
|
sgml-local-ecat-files:nil
|
||||||
|
End:
|
||||||
|
-->
|
Loading…
x
Reference in New Issue
Block a user