postgres/src/backend/utils/adt/lockfuncs.c

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

1070 lines
28 KiB
C
Raw Normal View History

/*-------------------------------------------------------------------------
*
2002-08-17 13:11:43 +00:00
* lockfuncs.c
* Functions for SQL access to various lock-manager capabilities.
2002-08-17 13:11:43 +00:00
*
* Copyright (c) 2002-2022, PostgreSQL Global Development Group
2002-08-17 13:11:43 +00:00
*
* IDENTIFICATION
2010-09-20 22:08:53 +02:00
* src/backend/utils/adt/lockfuncs.c
*
*-------------------------------------------------------------------------
2002-08-17 13:11:43 +00:00
*/
#include "postgres.h"
#include "access/htup_details.h"
#include "access/xact.h"
#include "catalog/pg_type.h"
#include "funcapi.h"
#include "miscadmin.h"
Implement genuine serializable isolation level. Until now, our Serializable mode has in fact been what's called Snapshot Isolation, which allows some anomalies that could not occur in any serialized ordering of the transactions. This patch fixes that using a method called Serializable Snapshot Isolation, based on research papers by Michael J. Cahill (see README-SSI for full references). In Serializable Snapshot Isolation, transactions run like they do in Snapshot Isolation, but a predicate lock manager observes the reads and writes performed and aborts transactions if it detects that an anomaly might occur. This method produces some false positives, ie. it sometimes aborts transactions even though there is no anomaly. To track reads we implement predicate locking, see storage/lmgr/predicate.c. Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared memory is finite, so when a transaction takes many tuple-level locks on a page, the locks are promoted to a single page-level lock, and further to a single relation level lock if necessary. To lock key values with no matching tuple, a sequential scan always takes a relation-level lock, and an index scan acquires a page-level lock that covers the search key, whether or not there are any matching keys at the moment. A predicate lock doesn't conflict with any regular locks or with another predicate locks in the normal sense. They're only used by the predicate lock manager to detect the danger of anomalies. Only serializable transactions participate in predicate locking, so there should be no extra overhead for for other transactions. Predicate locks can't be released at commit, but must be remembered until all the transactions that overlapped with it have completed. That means that we need to remember an unbounded amount of predicate locks, so we apply a lossy but conservative method of tracking locks for committed transactions. If we run short of shared memory, we overflow to a new "pg_serial" SLRU pool. We don't currently allow Serializable transactions in Hot Standby mode. That would be hard, because even read-only transactions can cause anomalies that wouldn't otherwise occur. Serializable isolation mode now means the new fully serializable level. Repeatable Read gives you the old Snapshot Isolation level that we have always had. Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and Anssi Kääriäinen
2011-02-07 23:46:51 +02:00
#include "storage/predicate_internals.h"
Create a function to reliably identify which sessions block which others. This patch introduces "pg_blocking_pids(int) returns int[]", which returns the PIDs of any sessions that are blocking the session with the given PID. Historically people have obtained such information using a self-join on the pg_locks view, but it's unreasonably tedious to do it that way with any modicum of correctness, and the addition of parallel queries has pretty much broken that approach altogether. (Given some more columns in the view than there are today, you could imagine handling parallel-query cases with a 4-way join; but ugh.) The new function has the following behaviors that are painful or impossible to get right via pg_locks: 1. Correctly understands which lock modes block which other ones. 2. In soft-block situations (two processes both waiting for conflicting lock modes), only the one that's in front in the wait queue is reported to block the other. 3. In parallel-query cases, reports all sessions blocking any member of the given PID's lock group, and reports a session by naming its leader process's PID, which will be the pg_backend_pid() value visible to clients. The motivation for doing this right now is mostly to fix the isolation tests. Commit 38f8bdcac4982215beb9f65a19debecaf22fd470 lobotomized isolationtester's is-it-waiting query by removing its ability to recognize nonconflicting lock modes, as a crude workaround for the inability to handle soft-block situations properly. But even without the lock mode tests, the old query was excessively slow, particularly in CLOBBER_CACHE_ALWAYS builds; some of our buildfarm animals fail the new deadlock-hard test because the deadlock timeout elapses before they can probe the waiting status of all eight sessions. Replacing the pg_locks self-join with use of pg_blocking_pids() is not only much more correct, but a lot faster: I measure it at about 9X faster in a typical dev build with Asserts, and 3X faster in CLOBBER_CACHE_ALWAYS builds. That should provide enough headroom for the slower CLOBBER_CACHE_ALWAYS animals to pass the test, without having to lengthen deadlock_timeout yet more and thus slow down the test for everyone else.
2016-02-22 14:31:43 -05:00
#include "utils/array.h"
#include "utils/builtins.h"
2002-08-17 13:11:43 +00:00
/*
* This must match enum LockTagType! Also, be sure to document any changes
* in the docs for the pg_locks view and for wait event types.
*/
const char *const LockTagTypeNames[] = {
"relation",
"extend",
"frozenid",
"page",
"tuple",
"transactionid",
"virtualxid",
"spectoken",
"object",
"userlock",
"advisory"
};
StaticAssertDecl(lengthof(LockTagTypeNames) == (LOCKTAG_ADVISORY + 1),
"array length mismatch");
Implement genuine serializable isolation level. Until now, our Serializable mode has in fact been what's called Snapshot Isolation, which allows some anomalies that could not occur in any serialized ordering of the transactions. This patch fixes that using a method called Serializable Snapshot Isolation, based on research papers by Michael J. Cahill (see README-SSI for full references). In Serializable Snapshot Isolation, transactions run like they do in Snapshot Isolation, but a predicate lock manager observes the reads and writes performed and aborts transactions if it detects that an anomaly might occur. This method produces some false positives, ie. it sometimes aborts transactions even though there is no anomaly. To track reads we implement predicate locking, see storage/lmgr/predicate.c. Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared memory is finite, so when a transaction takes many tuple-level locks on a page, the locks are promoted to a single page-level lock, and further to a single relation level lock if necessary. To lock key values with no matching tuple, a sequential scan always takes a relation-level lock, and an index scan acquires a page-level lock that covers the search key, whether or not there are any matching keys at the moment. A predicate lock doesn't conflict with any regular locks or with another predicate locks in the normal sense. They're only used by the predicate lock manager to detect the danger of anomalies. Only serializable transactions participate in predicate locking, so there should be no extra overhead for for other transactions. Predicate locks can't be released at commit, but must be remembered until all the transactions that overlapped with it have completed. That means that we need to remember an unbounded amount of predicate locks, so we apply a lossy but conservative method of tracking locks for committed transactions. If we run short of shared memory, we overflow to a new "pg_serial" SLRU pool. We don't currently allow Serializable transactions in Hot Standby mode. That would be hard, because even read-only transactions can cause anomalies that wouldn't otherwise occur. Serializable isolation mode now means the new fully serializable level. Repeatable Read gives you the old Snapshot Isolation level that we have always had. Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and Anssi Kääriäinen
2011-02-07 23:46:51 +02:00
/* This must match enum PredicateLockTargetType (predicate_internals.h) */
static const char *const PredicateLockTagTypeNames[] = {
"relation",
"page",
"tuple"
};
StaticAssertDecl(lengthof(PredicateLockTagTypeNames) == (PREDLOCKTAG_TUPLE + 1),
"array length mismatch");
/* Working status for pg_lock_status */
typedef struct
{
LockData *lockData; /* state data from lmgr */
int currIdx; /* current PROCLOCK index */
Implement genuine serializable isolation level. Until now, our Serializable mode has in fact been what's called Snapshot Isolation, which allows some anomalies that could not occur in any serialized ordering of the transactions. This patch fixes that using a method called Serializable Snapshot Isolation, based on research papers by Michael J. Cahill (see README-SSI for full references). In Serializable Snapshot Isolation, transactions run like they do in Snapshot Isolation, but a predicate lock manager observes the reads and writes performed and aborts transactions if it detects that an anomaly might occur. This method produces some false positives, ie. it sometimes aborts transactions even though there is no anomaly. To track reads we implement predicate locking, see storage/lmgr/predicate.c. Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared memory is finite, so when a transaction takes many tuple-level locks on a page, the locks are promoted to a single page-level lock, and further to a single relation level lock if necessary. To lock key values with no matching tuple, a sequential scan always takes a relation-level lock, and an index scan acquires a page-level lock that covers the search key, whether or not there are any matching keys at the moment. A predicate lock doesn't conflict with any regular locks or with another predicate locks in the normal sense. They're only used by the predicate lock manager to detect the danger of anomalies. Only serializable transactions participate in predicate locking, so there should be no extra overhead for for other transactions. Predicate locks can't be released at commit, but must be remembered until all the transactions that overlapped with it have completed. That means that we need to remember an unbounded amount of predicate locks, so we apply a lossy but conservative method of tracking locks for committed transactions. If we run short of shared memory, we overflow to a new "pg_serial" SLRU pool. We don't currently allow Serializable transactions in Hot Standby mode. That would be hard, because even read-only transactions can cause anomalies that wouldn't otherwise occur. Serializable isolation mode now means the new fully serializable level. Repeatable Read gives you the old Snapshot Isolation level that we have always had. Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and Anssi Kääriäinen
2011-02-07 23:46:51 +02:00
PredicateLockData *predLockData; /* state data for pred locks */
int predLockIdx; /* current index for pred lock */
} PG_Lock_Status;
/* Number of columns in pg_locks output */
Display the time when the process started waiting for the lock, in pg_locks, take 2 This commit adds new column "waitstart" into pg_locks view. This column reports the time when the server process started waiting for the lock if the lock is not held. This information is useful, for example, when examining the amount of time to wait on a lock by subtracting "waitstart" in pg_locks from the current time, and identify the lock that the processes are waiting for very long. This feature uses the current time obtained for the deadlock timeout timer as "waitstart" (i.e., the time when this process started waiting for the lock). Since getting the current time newly can cause overhead, we reuse the already-obtained time to avoid that overhead. Note that "waitstart" is updated without holding the lock table's partition lock, to avoid the overhead by additional lock acquisition. This can cause "waitstart" in pg_locks to become NULL for a very short period of time after the wait started even though "granted" is false. This is OK in practice because we can assume that users are likely to look at "waitstart" when waiting for the lock for a long time. The first attempt of this patch (commit 3b733fcd04) caused the buildfarm member "rorqual" (built with --disable-atomics --disable-spinlocks) to report the failure of the regression test. It was reverted by commit 890d2182a2. The cause of this failure was that the atomic variable for "waitstart" in the dummy process entry created at the end of prepare transaction was not initialized. This second attempt fixes that issue. Bump catalog version. Author: Atsushi Torikoshi Reviewed-by: Ian Lawrence Barwick, Robert Haas, Justin Pryzby, Fujii Masao Discussion: https://postgr.es/m/a96013dc51cdc56b2a2b84fa8a16a993@oss.nttdata.com
2021-02-15 15:13:37 +09:00
#define NUM_LOCK_STATUS_COLUMNS 16
/*
* VXIDGetDatum - Construct a text representation of a VXID
*
* This is currently only used in pg_lock_status, so we put it here.
*/
static Datum
VXIDGetDatum(BackendId bid, LocalTransactionId lxid)
{
/*
* The representation is "<bid>/<lxid>", decimal and unsigned decimal
* respectively. Note that elog.c also knows how to format a vxid.
*/
char vxidstr[32];
snprintf(vxidstr, sizeof(vxidstr), "%d/%u", bid, lxid);
return CStringGetTextDatum(vxidstr);
}
/*
* pg_lock_status - produce a view with one row per held or awaited lock mode
*/
2002-08-17 13:11:43 +00:00
Datum
pg_lock_status(PG_FUNCTION_ARGS)
2002-08-17 13:11:43 +00:00
{
FuncCallContext *funcctx;
PG_Lock_Status *mystatus;
LockData *lockData;
Implement genuine serializable isolation level. Until now, our Serializable mode has in fact been what's called Snapshot Isolation, which allows some anomalies that could not occur in any serialized ordering of the transactions. This patch fixes that using a method called Serializable Snapshot Isolation, based on research papers by Michael J. Cahill (see README-SSI for full references). In Serializable Snapshot Isolation, transactions run like they do in Snapshot Isolation, but a predicate lock manager observes the reads and writes performed and aborts transactions if it detects that an anomaly might occur. This method produces some false positives, ie. it sometimes aborts transactions even though there is no anomaly. To track reads we implement predicate locking, see storage/lmgr/predicate.c. Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared memory is finite, so when a transaction takes many tuple-level locks on a page, the locks are promoted to a single page-level lock, and further to a single relation level lock if necessary. To lock key values with no matching tuple, a sequential scan always takes a relation-level lock, and an index scan acquires a page-level lock that covers the search key, whether or not there are any matching keys at the moment. A predicate lock doesn't conflict with any regular locks or with another predicate locks in the normal sense. They're only used by the predicate lock manager to detect the danger of anomalies. Only serializable transactions participate in predicate locking, so there should be no extra overhead for for other transactions. Predicate locks can't be released at commit, but must be remembered until all the transactions that overlapped with it have completed. That means that we need to remember an unbounded amount of predicate locks, so we apply a lossy but conservative method of tracking locks for committed transactions. If we run short of shared memory, we overflow to a new "pg_serial" SLRU pool. We don't currently allow Serializable transactions in Hot Standby mode. That would be hard, because even read-only transactions can cause anomalies that wouldn't otherwise occur. Serializable isolation mode now means the new fully serializable level. Repeatable Read gives you the old Snapshot Isolation level that we have always had. Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and Anssi Kääriäinen
2011-02-07 23:46:51 +02:00
PredicateLockData *predLockData;
2002-08-17 13:11:43 +00:00
if (SRF_IS_FIRSTCALL())
{
TupleDesc tupdesc;
MemoryContext oldcontext;
2002-08-17 13:11:43 +00:00
/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();
/*
* switch to memory context appropriate for multiple function calls
*/
oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
/* build tupdesc for result tuples */
Create a function to reliably identify which sessions block which others. This patch introduces "pg_blocking_pids(int) returns int[]", which returns the PIDs of any sessions that are blocking the session with the given PID. Historically people have obtained such information using a self-join on the pg_locks view, but it's unreasonably tedious to do it that way with any modicum of correctness, and the addition of parallel queries has pretty much broken that approach altogether. (Given some more columns in the view than there are today, you could imagine handling parallel-query cases with a 4-way join; but ugh.) The new function has the following behaviors that are painful or impossible to get right via pg_locks: 1. Correctly understands which lock modes block which other ones. 2. In soft-block situations (two processes both waiting for conflicting lock modes), only the one that's in front in the wait queue is reported to block the other. 3. In parallel-query cases, reports all sessions blocking any member of the given PID's lock group, and reports a session by naming its leader process's PID, which will be the pg_backend_pid() value visible to clients. The motivation for doing this right now is mostly to fix the isolation tests. Commit 38f8bdcac4982215beb9f65a19debecaf22fd470 lobotomized isolationtester's is-it-waiting query by removing its ability to recognize nonconflicting lock modes, as a crude workaround for the inability to handle soft-block situations properly. But even without the lock mode tests, the old query was excessively slow, particularly in CLOBBER_CACHE_ALWAYS builds; some of our buildfarm animals fail the new deadlock-hard test because the deadlock timeout elapses before they can probe the waiting status of all eight sessions. Replacing the pg_locks self-join with use of pg_blocking_pids() is not only much more correct, but a lot faster: I measure it at about 9X faster in a typical dev build with Asserts, and 3X faster in CLOBBER_CACHE_ALWAYS builds. That should provide enough headroom for the slower CLOBBER_CACHE_ALWAYS animals to pass the test, without having to lengthen deadlock_timeout yet more and thus slow down the test for everyone else.
2016-02-22 14:31:43 -05:00
/* this had better match function's declaration in pg_proc.h */
Remove WITH OIDS support, change oid catalog column visibility. Previously tables declared WITH OIDS, including a significant fraction of the catalog tables, stored the oid column not as a normal column, but as part of the tuple header. This special column was not shown by default, which was somewhat odd, as it's often (consider e.g. pg_class.oid) one of the more important parts of a row. Neither pg_dump nor COPY included the contents of the oid column by default. The fact that the oid column was not an ordinary column necessitated a significant amount of special case code to support oid columns. That already was painful for the existing, but upcoming work aiming to make table storage pluggable, would have required expanding and duplicating that "specialness" significantly. WITH OIDS has been deprecated since 2005 (commit ff02d0a05280e0). Remove it. Removing includes: - CREATE TABLE and ALTER TABLE syntax for declaring the table to be WITH OIDS has been removed (WITH (oids[ = true]) will error out) - pg_dump does not support dumping tables declared WITH OIDS and will issue a warning when dumping one (and ignore the oid column). - restoring an pg_dump archive with pg_restore will warn when restoring a table with oid contents (and ignore the oid column) - COPY will refuse to load binary dump that includes oids. - pg_upgrade will error out when encountering tables declared WITH OIDS, they have to be altered to remove the oid column first. - Functionality to access the oid of the last inserted row (like plpgsql's RESULT_OID, spi's SPI_lastoid, ...) has been removed. The syntax for declaring a table WITHOUT OIDS (or WITH (oids = false) for CREATE TABLE) is still supported. While that requires a bit of support code, it seems unnecessary to break applications / dumps that do not use oids, and are explicit about not using them. The biggest user of WITH OID columns was postgres' catalog. This commit changes all 'magic' oid columns to be columns that are normally declared and stored. To reduce unnecessary query breakage all the newly added columns are still named 'oid', even if a table's column naming scheme would indicate 'reloid' or such. This obviously requires adapting a lot code, mostly replacing oid access via HeapTupleGetOid() with access to the underlying Form_pg_*->oid column. The bootstrap process now assigns oids for all oid columns in genbki.pl that do not have an explicit value (starting at the largest oid previously used), only oids assigned later by oids will be above FirstBootstrapObjectId. As the oid column now is a normal column the special bootstrap syntax for oids has been removed. Oids are not automatically assigned during insertion anymore, all backend code explicitly assigns oids with GetNewOidWithIndex(). For the rare case that insertions into the catalog via SQL are called for the new pg_nextoid() function can be used (which only works on catalog tables). The fact that oid columns on system tables are now normal columns means that they will be included in the set of columns expanded by * (i.e. SELECT * FROM pg_class will now include the table's oid, previously it did not). It'd not technically be hard to hide oid column by default, but that'd mean confusing behavior would either have to be carried forward forever, or it'd cause breakage down the line. While it's not unlikely that further adjustments are needed, the scope/invasiveness of the patch makes it worthwhile to get merge this now. It's painful to maintain externally, too complicated to commit after the code code freeze, and a dependency of a number of other patches. Catversion bump, for obvious reasons. Author: Andres Freund, with contributions by John Naylor Discussion: https://postgr.es/m/20180930034810.ywp2c7awz7opzcfr@alap3.anarazel.de
2018-11-20 15:36:57 -08:00
tupdesc = CreateTemplateTupleDesc(NUM_LOCK_STATUS_COLUMNS);
TupleDescInitEntry(tupdesc, (AttrNumber) 1, "locktype",
TEXTOID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 2, "database",
OIDOID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 3, "relation",
OIDOID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 4, "page",
INT4OID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 5, "tuple",
INT2OID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 6, "virtualxid",
TEXTOID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 7, "transactionid",
XIDOID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 8, "classid",
OIDOID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 9, "objid",
OIDOID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 10, "objsubid",
INT2OID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 11, "virtualtransaction",
TEXTOID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 12, "pid",
INT4OID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 13, "mode",
TEXTOID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 14, "granted",
BOOLOID, -1, 0);
TupleDescInitEntry(tupdesc, (AttrNumber) 15, "fastpath",
BOOLOID, -1, 0);
Display the time when the process started waiting for the lock, in pg_locks, take 2 This commit adds new column "waitstart" into pg_locks view. This column reports the time when the server process started waiting for the lock if the lock is not held. This information is useful, for example, when examining the amount of time to wait on a lock by subtracting "waitstart" in pg_locks from the current time, and identify the lock that the processes are waiting for very long. This feature uses the current time obtained for the deadlock timeout timer as "waitstart" (i.e., the time when this process started waiting for the lock). Since getting the current time newly can cause overhead, we reuse the already-obtained time to avoid that overhead. Note that "waitstart" is updated without holding the lock table's partition lock, to avoid the overhead by additional lock acquisition. This can cause "waitstart" in pg_locks to become NULL for a very short period of time after the wait started even though "granted" is false. This is OK in practice because we can assume that users are likely to look at "waitstart" when waiting for the lock for a long time. The first attempt of this patch (commit 3b733fcd04) caused the buildfarm member "rorqual" (built with --disable-atomics --disable-spinlocks) to report the failure of the regression test. It was reverted by commit 890d2182a2. The cause of this failure was that the atomic variable for "waitstart" in the dummy process entry created at the end of prepare transaction was not initialized. This second attempt fixes that issue. Bump catalog version. Author: Atsushi Torikoshi Reviewed-by: Ian Lawrence Barwick, Robert Haas, Justin Pryzby, Fujii Masao Discussion: https://postgr.es/m/a96013dc51cdc56b2a2b84fa8a16a993@oss.nttdata.com
2021-02-15 15:13:37 +09:00
TupleDescInitEntry(tupdesc, (AttrNumber) 16, "waitstart",
TIMESTAMPTZOID, -1, 0);
funcctx->tuple_desc = BlessTupleDesc(tupdesc);
2002-08-17 13:11:43 +00:00
/*
* Collect all the locking information that we will format and send
* out as a result set.
2002-08-17 13:11:43 +00:00
*/
mystatus = (PG_Lock_Status *) palloc(sizeof(PG_Lock_Status));
funcctx->user_fctx = (void *) mystatus;
2002-08-17 13:11:43 +00:00
mystatus->lockData = GetLockStatusData();
mystatus->currIdx = 0;
Implement genuine serializable isolation level. Until now, our Serializable mode has in fact been what's called Snapshot Isolation, which allows some anomalies that could not occur in any serialized ordering of the transactions. This patch fixes that using a method called Serializable Snapshot Isolation, based on research papers by Michael J. Cahill (see README-SSI for full references). In Serializable Snapshot Isolation, transactions run like they do in Snapshot Isolation, but a predicate lock manager observes the reads and writes performed and aborts transactions if it detects that an anomaly might occur. This method produces some false positives, ie. it sometimes aborts transactions even though there is no anomaly. To track reads we implement predicate locking, see storage/lmgr/predicate.c. Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared memory is finite, so when a transaction takes many tuple-level locks on a page, the locks are promoted to a single page-level lock, and further to a single relation level lock if necessary. To lock key values with no matching tuple, a sequential scan always takes a relation-level lock, and an index scan acquires a page-level lock that covers the search key, whether or not there are any matching keys at the moment. A predicate lock doesn't conflict with any regular locks or with another predicate locks in the normal sense. They're only used by the predicate lock manager to detect the danger of anomalies. Only serializable transactions participate in predicate locking, so there should be no extra overhead for for other transactions. Predicate locks can't be released at commit, but must be remembered until all the transactions that overlapped with it have completed. That means that we need to remember an unbounded amount of predicate locks, so we apply a lossy but conservative method of tracking locks for committed transactions. If we run short of shared memory, we overflow to a new "pg_serial" SLRU pool. We don't currently allow Serializable transactions in Hot Standby mode. That would be hard, because even read-only transactions can cause anomalies that wouldn't otherwise occur. Serializable isolation mode now means the new fully serializable level. Repeatable Read gives you the old Snapshot Isolation level that we have always had. Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and Anssi Kääriäinen
2011-02-07 23:46:51 +02:00
mystatus->predLockData = GetPredicateLockStatusData();
mystatus->predLockIdx = 0;
2002-08-17 13:11:43 +00:00
MemoryContextSwitchTo(oldcontext);
2002-08-17 13:11:43 +00:00
}
funcctx = SRF_PERCALL_SETUP();
mystatus = (PG_Lock_Status *) funcctx->user_fctx;
lockData = mystatus->lockData;
2002-08-17 13:11:43 +00:00
while (mystatus->currIdx < lockData->nelements)
2002-08-17 13:11:43 +00:00
{
bool granted;
LOCKMODE mode = 0;
const char *locktypename;
char tnbuf[32];
Datum values[NUM_LOCK_STATUS_COLUMNS];
bool nulls[NUM_LOCK_STATUS_COLUMNS];
2002-08-17 13:11:43 +00:00
HeapTuple tuple;
Datum result;
LockInstanceData *instance;
2002-09-04 20:31:48 +00:00
instance = &(lockData->locks[mystatus->currIdx]);
2002-08-17 13:11:43 +00:00
/*
* Look to see if there are any held lock modes in this PROCLOCK. If
* so, report, and destructively modify lockData so we don't report
* again.
2002-08-17 13:11:43 +00:00
*/
granted = false;
if (instance->holdMask)
2002-08-17 13:11:43 +00:00
{
for (mode = 0; mode < MAX_LOCKMODES; mode++)
{
if (instance->holdMask & LOCKBIT_ON(mode))
{
granted = true;
instance->holdMask &= LOCKBIT_OFF(mode);
break;
}
}
2002-08-17 13:11:43 +00:00
}
/*
* If no (more) held modes to report, see if PROC is waiting for a
* lock on this lock.
*/
if (!granted)
2002-08-17 13:11:43 +00:00
{
if (instance->waitLockMode != NoLock)
{
/* Yes, so report it with proper mode */
mode = instance->waitLockMode;
2002-09-04 20:31:48 +00:00
/*
* We are now done with this PROCLOCK, so advance pointer to
* continue with next one on next call.
*/
mystatus->currIdx++;
}
else
{
/*
* Okay, we've displayed all the locks associated with this
* PROCLOCK, proceed to the next one.
*/
mystatus->currIdx++;
continue;
}
}
2002-08-17 13:11:43 +00:00
/*
* Form tuple with appropriate data.
*/
MemSet(values, 0, sizeof(values));
MemSet(nulls, false, sizeof(nulls));
if (instance->locktag.locktag_type <= LOCKTAG_LAST_TYPE)
locktypename = LockTagTypeNames[instance->locktag.locktag_type];
else
{
snprintf(tnbuf, sizeof(tnbuf), "unknown %d",
(int) instance->locktag.locktag_type);
locktypename = tnbuf;
}
values[0] = CStringGetTextDatum(locktypename);
switch ((LockTagType) instance->locktag.locktag_type)
{
case LOCKTAG_RELATION:
case LOCKTAG_RELATION_EXTEND:
values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
nulls[3] = true;
nulls[4] = true;
nulls[5] = true;
nulls[6] = true;
nulls[7] = true;
nulls[8] = true;
nulls[9] = true;
break;
case LOCKTAG_DATABASE_FROZEN_IDS:
values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
nulls[2] = true;
nulls[3] = true;
nulls[4] = true;
nulls[5] = true;
nulls[6] = true;
nulls[7] = true;
nulls[8] = true;
nulls[9] = true;
break;
case LOCKTAG_PAGE:
values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
values[3] = UInt32GetDatum(instance->locktag.locktag_field3);
nulls[4] = true;
nulls[5] = true;
nulls[6] = true;
nulls[7] = true;
nulls[8] = true;
nulls[9] = true;
break;
case LOCKTAG_TUPLE:
values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
values[2] = ObjectIdGetDatum(instance->locktag.locktag_field2);
values[3] = UInt32GetDatum(instance->locktag.locktag_field3);
values[4] = UInt16GetDatum(instance->locktag.locktag_field4);
nulls[5] = true;
nulls[6] = true;
nulls[7] = true;
nulls[8] = true;
nulls[9] = true;
break;
case LOCKTAG_TRANSACTION:
values[6] =
TransactionIdGetDatum(instance->locktag.locktag_field1);
nulls[1] = true;
nulls[2] = true;
nulls[3] = true;
nulls[4] = true;
nulls[5] = true;
nulls[7] = true;
nulls[8] = true;
nulls[9] = true;
break;
case LOCKTAG_VIRTUALTRANSACTION:
values[5] = VXIDGetDatum(instance->locktag.locktag_field1,
instance->locktag.locktag_field2);
nulls[1] = true;
nulls[2] = true;
nulls[3] = true;
nulls[4] = true;
nulls[6] = true;
nulls[7] = true;
nulls[8] = true;
nulls[9] = true;
break;
case LOCKTAG_OBJECT:
case LOCKTAG_USERLOCK:
case LOCKTAG_ADVISORY:
default: /* treat unknown locktags like OBJECT */
values[1] = ObjectIdGetDatum(instance->locktag.locktag_field1);
values[7] = ObjectIdGetDatum(instance->locktag.locktag_field2);
values[8] = ObjectIdGetDatum(instance->locktag.locktag_field3);
values[9] = Int16GetDatum(instance->locktag.locktag_field4);
nulls[2] = true;
nulls[3] = true;
nulls[4] = true;
nulls[5] = true;
nulls[6] = true;
break;
2002-08-17 13:11:43 +00:00
}
values[10] = VXIDGetDatum(instance->backend, instance->lxid);
if (instance->pid != 0)
values[11] = Int32GetDatum(instance->pid);
else
nulls[11] = true;
values[12] = CStringGetTextDatum(GetLockmodeName(instance->locktag.locktag_lockmethodid, mode));
values[13] = BoolGetDatum(granted);
values[14] = BoolGetDatum(instance->fastpath);
Display the time when the process started waiting for the lock, in pg_locks, take 2 This commit adds new column "waitstart" into pg_locks view. This column reports the time when the server process started waiting for the lock if the lock is not held. This information is useful, for example, when examining the amount of time to wait on a lock by subtracting "waitstart" in pg_locks from the current time, and identify the lock that the processes are waiting for very long. This feature uses the current time obtained for the deadlock timeout timer as "waitstart" (i.e., the time when this process started waiting for the lock). Since getting the current time newly can cause overhead, we reuse the already-obtained time to avoid that overhead. Note that "waitstart" is updated without holding the lock table's partition lock, to avoid the overhead by additional lock acquisition. This can cause "waitstart" in pg_locks to become NULL for a very short period of time after the wait started even though "granted" is false. This is OK in practice because we can assume that users are likely to look at "waitstart" when waiting for the lock for a long time. The first attempt of this patch (commit 3b733fcd04) caused the buildfarm member "rorqual" (built with --disable-atomics --disable-spinlocks) to report the failure of the regression test. It was reverted by commit 890d2182a2. The cause of this failure was that the atomic variable for "waitstart" in the dummy process entry created at the end of prepare transaction was not initialized. This second attempt fixes that issue. Bump catalog version. Author: Atsushi Torikoshi Reviewed-by: Ian Lawrence Barwick, Robert Haas, Justin Pryzby, Fujii Masao Discussion: https://postgr.es/m/a96013dc51cdc56b2a2b84fa8a16a993@oss.nttdata.com
2021-02-15 15:13:37 +09:00
if (!granted && instance->waitStart != 0)
values[15] = TimestampTzGetDatum(instance->waitStart);
else
nulls[15] = true;
2002-08-17 13:11:43 +00:00
tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
result = HeapTupleGetDatum(tuple);
SRF_RETURN_NEXT(funcctx, result);
2002-08-17 13:11:43 +00:00
}
Implement genuine serializable isolation level. Until now, our Serializable mode has in fact been what's called Snapshot Isolation, which allows some anomalies that could not occur in any serialized ordering of the transactions. This patch fixes that using a method called Serializable Snapshot Isolation, based on research papers by Michael J. Cahill (see README-SSI for full references). In Serializable Snapshot Isolation, transactions run like they do in Snapshot Isolation, but a predicate lock manager observes the reads and writes performed and aborts transactions if it detects that an anomaly might occur. This method produces some false positives, ie. it sometimes aborts transactions even though there is no anomaly. To track reads we implement predicate locking, see storage/lmgr/predicate.c. Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared memory is finite, so when a transaction takes many tuple-level locks on a page, the locks are promoted to a single page-level lock, and further to a single relation level lock if necessary. To lock key values with no matching tuple, a sequential scan always takes a relation-level lock, and an index scan acquires a page-level lock that covers the search key, whether or not there are any matching keys at the moment. A predicate lock doesn't conflict with any regular locks or with another predicate locks in the normal sense. They're only used by the predicate lock manager to detect the danger of anomalies. Only serializable transactions participate in predicate locking, so there should be no extra overhead for for other transactions. Predicate locks can't be released at commit, but must be remembered until all the transactions that overlapped with it have completed. That means that we need to remember an unbounded amount of predicate locks, so we apply a lossy but conservative method of tracking locks for committed transactions. If we run short of shared memory, we overflow to a new "pg_serial" SLRU pool. We don't currently allow Serializable transactions in Hot Standby mode. That would be hard, because even read-only transactions can cause anomalies that wouldn't otherwise occur. Serializable isolation mode now means the new fully serializable level. Repeatable Read gives you the old Snapshot Isolation level that we have always had. Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and Anssi Kääriäinen
2011-02-07 23:46:51 +02:00
/*
* Have returned all regular locks. Now start on the SIREAD predicate
* locks.
*/
predLockData = mystatus->predLockData;
if (mystatus->predLockIdx < predLockData->nelements)
{
PredicateLockTargetType lockType;
PREDICATELOCKTARGETTAG *predTag = &(predLockData->locktags[mystatus->predLockIdx]);
SERIALIZABLEXACT *xact = &(predLockData->xacts[mystatus->predLockIdx]);
Datum values[NUM_LOCK_STATUS_COLUMNS];
bool nulls[NUM_LOCK_STATUS_COLUMNS];
Implement genuine serializable isolation level. Until now, our Serializable mode has in fact been what's called Snapshot Isolation, which allows some anomalies that could not occur in any serialized ordering of the transactions. This patch fixes that using a method called Serializable Snapshot Isolation, based on research papers by Michael J. Cahill (see README-SSI for full references). In Serializable Snapshot Isolation, transactions run like they do in Snapshot Isolation, but a predicate lock manager observes the reads and writes performed and aborts transactions if it detects that an anomaly might occur. This method produces some false positives, ie. it sometimes aborts transactions even though there is no anomaly. To track reads we implement predicate locking, see storage/lmgr/predicate.c. Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared memory is finite, so when a transaction takes many tuple-level locks on a page, the locks are promoted to a single page-level lock, and further to a single relation level lock if necessary. To lock key values with no matching tuple, a sequential scan always takes a relation-level lock, and an index scan acquires a page-level lock that covers the search key, whether or not there are any matching keys at the moment. A predicate lock doesn't conflict with any regular locks or with another predicate locks in the normal sense. They're only used by the predicate lock manager to detect the danger of anomalies. Only serializable transactions participate in predicate locking, so there should be no extra overhead for for other transactions. Predicate locks can't be released at commit, but must be remembered until all the transactions that overlapped with it have completed. That means that we need to remember an unbounded amount of predicate locks, so we apply a lossy but conservative method of tracking locks for committed transactions. If we run short of shared memory, we overflow to a new "pg_serial" SLRU pool. We don't currently allow Serializable transactions in Hot Standby mode. That would be hard, because even read-only transactions can cause anomalies that wouldn't otherwise occur. Serializable isolation mode now means the new fully serializable level. Repeatable Read gives you the old Snapshot Isolation level that we have always had. Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and Anssi Kääriäinen
2011-02-07 23:46:51 +02:00
HeapTuple tuple;
Datum result;
mystatus->predLockIdx++;
/*
* Form tuple with appropriate data.
*/
MemSet(values, 0, sizeof(values));
MemSet(nulls, false, sizeof(nulls));
/* lock type */
lockType = GET_PREDICATELOCKTARGETTAG_TYPE(*predTag);
values[0] = CStringGetTextDatum(PredicateLockTagTypeNames[lockType]);
/* lock target */
values[1] = GET_PREDICATELOCKTARGETTAG_DB(*predTag);
values[2] = GET_PREDICATELOCKTARGETTAG_RELATION(*predTag);
if (lockType == PREDLOCKTAG_TUPLE)
values[4] = GET_PREDICATELOCKTARGETTAG_OFFSET(*predTag);
else
nulls[4] = true;
if ((lockType == PREDLOCKTAG_TUPLE) ||
(lockType == PREDLOCKTAG_PAGE))
values[3] = GET_PREDICATELOCKTARGETTAG_PAGE(*predTag);
else
nulls[3] = true;
/* these fields are targets for other types of locks */
nulls[5] = true; /* virtualxid */
nulls[6] = true; /* transactionid */
nulls[7] = true; /* classid */
nulls[8] = true; /* objid */
nulls[9] = true; /* objsubid */
/* lock holder */
values[10] = VXIDGetDatum(xact->vxid.backendId,
xact->vxid.localTransactionId);
if (xact->pid != 0)
values[11] = Int32GetDatum(xact->pid);
else
nulls[11] = true;
Implement genuine serializable isolation level. Until now, our Serializable mode has in fact been what's called Snapshot Isolation, which allows some anomalies that could not occur in any serialized ordering of the transactions. This patch fixes that using a method called Serializable Snapshot Isolation, based on research papers by Michael J. Cahill (see README-SSI for full references). In Serializable Snapshot Isolation, transactions run like they do in Snapshot Isolation, but a predicate lock manager observes the reads and writes performed and aborts transactions if it detects that an anomaly might occur. This method produces some false positives, ie. it sometimes aborts transactions even though there is no anomaly. To track reads we implement predicate locking, see storage/lmgr/predicate.c. Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared memory is finite, so when a transaction takes many tuple-level locks on a page, the locks are promoted to a single page-level lock, and further to a single relation level lock if necessary. To lock key values with no matching tuple, a sequential scan always takes a relation-level lock, and an index scan acquires a page-level lock that covers the search key, whether or not there are any matching keys at the moment. A predicate lock doesn't conflict with any regular locks or with another predicate locks in the normal sense. They're only used by the predicate lock manager to detect the danger of anomalies. Only serializable transactions participate in predicate locking, so there should be no extra overhead for for other transactions. Predicate locks can't be released at commit, but must be remembered until all the transactions that overlapped with it have completed. That means that we need to remember an unbounded amount of predicate locks, so we apply a lossy but conservative method of tracking locks for committed transactions. If we run short of shared memory, we overflow to a new "pg_serial" SLRU pool. We don't currently allow Serializable transactions in Hot Standby mode. That would be hard, because even read-only transactions can cause anomalies that wouldn't otherwise occur. Serializable isolation mode now means the new fully serializable level. Repeatable Read gives you the old Snapshot Isolation level that we have always had. Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and Anssi Kääriäinen
2011-02-07 23:46:51 +02:00
/*
* Lock mode. Currently all predicate locks are SIReadLocks, which are
* always held (never waiting) and have no fast path
Implement genuine serializable isolation level. Until now, our Serializable mode has in fact been what's called Snapshot Isolation, which allows some anomalies that could not occur in any serialized ordering of the transactions. This patch fixes that using a method called Serializable Snapshot Isolation, based on research papers by Michael J. Cahill (see README-SSI for full references). In Serializable Snapshot Isolation, transactions run like they do in Snapshot Isolation, but a predicate lock manager observes the reads and writes performed and aborts transactions if it detects that an anomaly might occur. This method produces some false positives, ie. it sometimes aborts transactions even though there is no anomaly. To track reads we implement predicate locking, see storage/lmgr/predicate.c. Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared memory is finite, so when a transaction takes many tuple-level locks on a page, the locks are promoted to a single page-level lock, and further to a single relation level lock if necessary. To lock key values with no matching tuple, a sequential scan always takes a relation-level lock, and an index scan acquires a page-level lock that covers the search key, whether or not there are any matching keys at the moment. A predicate lock doesn't conflict with any regular locks or with another predicate locks in the normal sense. They're only used by the predicate lock manager to detect the danger of anomalies. Only serializable transactions participate in predicate locking, so there should be no extra overhead for for other transactions. Predicate locks can't be released at commit, but must be remembered until all the transactions that overlapped with it have completed. That means that we need to remember an unbounded amount of predicate locks, so we apply a lossy but conservative method of tracking locks for committed transactions. If we run short of shared memory, we overflow to a new "pg_serial" SLRU pool. We don't currently allow Serializable transactions in Hot Standby mode. That would be hard, because even read-only transactions can cause anomalies that wouldn't otherwise occur. Serializable isolation mode now means the new fully serializable level. Repeatable Read gives you the old Snapshot Isolation level that we have always had. Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and Anssi Kääriäinen
2011-02-07 23:46:51 +02:00
*/
values[12] = CStringGetTextDatum("SIReadLock");
values[13] = BoolGetDatum(true);
values[14] = BoolGetDatum(false);
Display the time when the process started waiting for the lock, in pg_locks, take 2 This commit adds new column "waitstart" into pg_locks view. This column reports the time when the server process started waiting for the lock if the lock is not held. This information is useful, for example, when examining the amount of time to wait on a lock by subtracting "waitstart" in pg_locks from the current time, and identify the lock that the processes are waiting for very long. This feature uses the current time obtained for the deadlock timeout timer as "waitstart" (i.e., the time when this process started waiting for the lock). Since getting the current time newly can cause overhead, we reuse the already-obtained time to avoid that overhead. Note that "waitstart" is updated without holding the lock table's partition lock, to avoid the overhead by additional lock acquisition. This can cause "waitstart" in pg_locks to become NULL for a very short period of time after the wait started even though "granted" is false. This is OK in practice because we can assume that users are likely to look at "waitstart" when waiting for the lock for a long time. The first attempt of this patch (commit 3b733fcd04) caused the buildfarm member "rorqual" (built with --disable-atomics --disable-spinlocks) to report the failure of the regression test. It was reverted by commit 890d2182a2. The cause of this failure was that the atomic variable for "waitstart" in the dummy process entry created at the end of prepare transaction was not initialized. This second attempt fixes that issue. Bump catalog version. Author: Atsushi Torikoshi Reviewed-by: Ian Lawrence Barwick, Robert Haas, Justin Pryzby, Fujii Masao Discussion: https://postgr.es/m/a96013dc51cdc56b2a2b84fa8a16a993@oss.nttdata.com
2021-02-15 15:13:37 +09:00
nulls[15] = true;
Implement genuine serializable isolation level. Until now, our Serializable mode has in fact been what's called Snapshot Isolation, which allows some anomalies that could not occur in any serialized ordering of the transactions. This patch fixes that using a method called Serializable Snapshot Isolation, based on research papers by Michael J. Cahill (see README-SSI for full references). In Serializable Snapshot Isolation, transactions run like they do in Snapshot Isolation, but a predicate lock manager observes the reads and writes performed and aborts transactions if it detects that an anomaly might occur. This method produces some false positives, ie. it sometimes aborts transactions even though there is no anomaly. To track reads we implement predicate locking, see storage/lmgr/predicate.c. Whenever a tuple is read, a predicate lock is acquired on the tuple. Shared memory is finite, so when a transaction takes many tuple-level locks on a page, the locks are promoted to a single page-level lock, and further to a single relation level lock if necessary. To lock key values with no matching tuple, a sequential scan always takes a relation-level lock, and an index scan acquires a page-level lock that covers the search key, whether or not there are any matching keys at the moment. A predicate lock doesn't conflict with any regular locks or with another predicate locks in the normal sense. They're only used by the predicate lock manager to detect the danger of anomalies. Only serializable transactions participate in predicate locking, so there should be no extra overhead for for other transactions. Predicate locks can't be released at commit, but must be remembered until all the transactions that overlapped with it have completed. That means that we need to remember an unbounded amount of predicate locks, so we apply a lossy but conservative method of tracking locks for committed transactions. If we run short of shared memory, we overflow to a new "pg_serial" SLRU pool. We don't currently allow Serializable transactions in Hot Standby mode. That would be hard, because even read-only transactions can cause anomalies that wouldn't otherwise occur. Serializable isolation mode now means the new fully serializable level. Repeatable Read gives you the old Snapshot Isolation level that we have always had. Kevin Grittner and Dan Ports, reviewed by Jeff Davis, Heikki Linnakangas and Anssi Kääriäinen
2011-02-07 23:46:51 +02:00
tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
result = HeapTupleGetDatum(tuple);
SRF_RETURN_NEXT(funcctx, result);
}
SRF_RETURN_DONE(funcctx);
2002-08-17 13:11:43 +00:00
}
Create a function to reliably identify which sessions block which others. This patch introduces "pg_blocking_pids(int) returns int[]", which returns the PIDs of any sessions that are blocking the session with the given PID. Historically people have obtained such information using a self-join on the pg_locks view, but it's unreasonably tedious to do it that way with any modicum of correctness, and the addition of parallel queries has pretty much broken that approach altogether. (Given some more columns in the view than there are today, you could imagine handling parallel-query cases with a 4-way join; but ugh.) The new function has the following behaviors that are painful or impossible to get right via pg_locks: 1. Correctly understands which lock modes block which other ones. 2. In soft-block situations (two processes both waiting for conflicting lock modes), only the one that's in front in the wait queue is reported to block the other. 3. In parallel-query cases, reports all sessions blocking any member of the given PID's lock group, and reports a session by naming its leader process's PID, which will be the pg_backend_pid() value visible to clients. The motivation for doing this right now is mostly to fix the isolation tests. Commit 38f8bdcac4982215beb9f65a19debecaf22fd470 lobotomized isolationtester's is-it-waiting query by removing its ability to recognize nonconflicting lock modes, as a crude workaround for the inability to handle soft-block situations properly. But even without the lock mode tests, the old query was excessively slow, particularly in CLOBBER_CACHE_ALWAYS builds; some of our buildfarm animals fail the new deadlock-hard test because the deadlock timeout elapses before they can probe the waiting status of all eight sessions. Replacing the pg_locks self-join with use of pg_blocking_pids() is not only much more correct, but a lot faster: I measure it at about 9X faster in a typical dev build with Asserts, and 3X faster in CLOBBER_CACHE_ALWAYS builds. That should provide enough headroom for the slower CLOBBER_CACHE_ALWAYS animals to pass the test, without having to lengthen deadlock_timeout yet more and thus slow down the test for everyone else.
2016-02-22 14:31:43 -05:00
/*
* pg_blocking_pids - produce an array of the PIDs blocking given PID
*
* The reported PIDs are those that hold a lock conflicting with blocked_pid's
* current request (hard block), or are requesting such a lock and are ahead
* of blocked_pid in the lock's wait queue (soft block).
*
* In parallel-query cases, we report all PIDs blocking any member of the
* given PID's lock group, and the reported PIDs are those of the blocking
* PIDs' lock group leaders. This allows callers to compare the result to
* lists of clients' pg_backend_pid() results even during a parallel query.
*
* Parallel query makes it possible for there to be duplicate PIDs in the
* result (either because multiple waiters are blocked by same PID, or
* because multiple blockers have same group leader PID). We do not bother
* to eliminate such duplicates from the result.
*
* We need not consider predicate locks here, since those don't block anything.
*/
Datum
pg_blocking_pids(PG_FUNCTION_ARGS)
{
int blocked_pid = PG_GETARG_INT32(0);
Datum *arrayelems;
int narrayelems;
BlockedProcsData *lockData; /* state data from lmgr */
int i,
j;
/* Collect a snapshot of lock manager state */
lockData = GetBlockerStatusData(blocked_pid);
/* We can't need more output entries than there are reported PROCLOCKs */
arrayelems = (Datum *) palloc(lockData->nlocks * sizeof(Datum));
narrayelems = 0;
/* For each blocked proc in the lock group ... */
for (i = 0; i < lockData->nprocs; i++)
{
BlockedProcData *bproc = &lockData->procs[i];
LockInstanceData *instances = &lockData->locks[bproc->first_lock];
int *preceding_waiters = &lockData->waiter_pids[bproc->first_waiter];
LockInstanceData *blocked_instance;
LockMethod lockMethodTable;
int conflictMask;
/*
* Locate the blocked proc's own entry in the LockInstanceData array.
* There should be exactly one matching entry.
*/
blocked_instance = NULL;
for (j = 0; j < bproc->num_locks; j++)
{
LockInstanceData *instance = &(instances[j]);
if (instance->pid == bproc->pid)
{
Assert(blocked_instance == NULL);
blocked_instance = instance;
}
}
Assert(blocked_instance != NULL);
lockMethodTable = GetLockTagsMethodTable(&(blocked_instance->locktag));
conflictMask = lockMethodTable->conflictTab[blocked_instance->waitLockMode];
/* Now scan the PROCLOCK data for conflicting procs */
for (j = 0; j < bproc->num_locks; j++)
{
LockInstanceData *instance = &(instances[j]);
/* A proc never blocks itself, so ignore that entry */
if (instance == blocked_instance)
continue;
/* Members of same lock group never block each other, either */
if (instance->leaderPid == blocked_instance->leaderPid)
continue;
if (conflictMask & instance->holdMask)
{
/* hard block: blocked by lock already held by this entry */
}
else if (instance->waitLockMode != NoLock &&
(conflictMask & LOCKBIT_ON(instance->waitLockMode)))
{
/* conflict in lock requests; who's in front in wait queue? */
bool ahead = false;
int k;
for (k = 0; k < bproc->num_waiters; k++)
{
if (preceding_waiters[k] == instance->pid)
{
/* soft block: this entry is ahead of blocked proc */
ahead = true;
break;
}
}
if (!ahead)
continue; /* not blocked by this entry */
}
else
{
/* not blocked by this entry */
continue;
}
/* blocked by this entry, so emit a record */
arrayelems[narrayelems++] = Int32GetDatum(instance->leaderPid);
}
}
/* Assert we didn't overrun arrayelems[] */
Assert(narrayelems <= lockData->nlocks);
/* Construct array, using hardwired knowledge about int4 type */
PG_RETURN_ARRAYTYPE_P(construct_array(arrayelems, narrayelems,
INT4OID,
sizeof(int32), true, TYPALIGN_INT));
Create a function to reliably identify which sessions block which others. This patch introduces "pg_blocking_pids(int) returns int[]", which returns the PIDs of any sessions that are blocking the session with the given PID. Historically people have obtained such information using a self-join on the pg_locks view, but it's unreasonably tedious to do it that way with any modicum of correctness, and the addition of parallel queries has pretty much broken that approach altogether. (Given some more columns in the view than there are today, you could imagine handling parallel-query cases with a 4-way join; but ugh.) The new function has the following behaviors that are painful or impossible to get right via pg_locks: 1. Correctly understands which lock modes block which other ones. 2. In soft-block situations (two processes both waiting for conflicting lock modes), only the one that's in front in the wait queue is reported to block the other. 3. In parallel-query cases, reports all sessions blocking any member of the given PID's lock group, and reports a session by naming its leader process's PID, which will be the pg_backend_pid() value visible to clients. The motivation for doing this right now is mostly to fix the isolation tests. Commit 38f8bdcac4982215beb9f65a19debecaf22fd470 lobotomized isolationtester's is-it-waiting query by removing its ability to recognize nonconflicting lock modes, as a crude workaround for the inability to handle soft-block situations properly. But even without the lock mode tests, the old query was excessively slow, particularly in CLOBBER_CACHE_ALWAYS builds; some of our buildfarm animals fail the new deadlock-hard test because the deadlock timeout elapses before they can probe the waiting status of all eight sessions. Replacing the pg_locks self-join with use of pg_blocking_pids() is not only much more correct, but a lot faster: I measure it at about 9X faster in a typical dev build with Asserts, and 3X faster in CLOBBER_CACHE_ALWAYS builds. That should provide enough headroom for the slower CLOBBER_CACHE_ALWAYS animals to pass the test, without having to lengthen deadlock_timeout yet more and thus slow down the test for everyone else.
2016-02-22 14:31:43 -05:00
}
/*
* pg_safe_snapshot_blocking_pids - produce an array of the PIDs blocking
* given PID from getting a safe snapshot
*
* XXX this does not consider parallel-query cases; not clear how big a
* problem that is in practice
*/
Datum
pg_safe_snapshot_blocking_pids(PG_FUNCTION_ARGS)
{
int blocked_pid = PG_GETARG_INT32(0);
int *blockers;
int num_blockers;
Datum *blocker_datums;
/* A buffer big enough for any possible blocker list without truncation */
blockers = (int *) palloc(MaxBackends * sizeof(int));
/* Collect a snapshot of processes waited for by GetSafeSnapshot */
num_blockers =
GetSafeSnapshotBlockingPids(blocked_pid, blockers, MaxBackends);
/* Convert int array to Datum array */
if (num_blockers > 0)
{
int i;
blocker_datums = (Datum *) palloc(num_blockers * sizeof(Datum));
for (i = 0; i < num_blockers; ++i)
blocker_datums[i] = Int32GetDatum(blockers[i]);
}
else
blocker_datums = NULL;
/* Construct array, using hardwired knowledge about int4 type */
PG_RETURN_ARRAYTYPE_P(construct_array(blocker_datums, num_blockers,
INT4OID,
sizeof(int32), true, TYPALIGN_INT));
}
/*
* pg_isolation_test_session_is_blocked - support function for isolationtester
*
* Check if specified PID is blocked by any of the PIDs listed in the second
* argument. Currently, this looks for blocking caused by waiting for
* heavyweight locks or safe snapshots. We ignore blockage caused by PIDs
* not directly under the isolationtester's control, eg autovacuum.
*
* This is an undocumented function intended for use by the isolation tester,
* and may change in future releases as required for testing purposes.
*/
Datum
pg_isolation_test_session_is_blocked(PG_FUNCTION_ARGS)
{
int blocked_pid = PG_GETARG_INT32(0);
ArrayType *interesting_pids_a = PG_GETARG_ARRAYTYPE_P(1);
ArrayType *blocking_pids_a;
int32 *interesting_pids;
int32 *blocking_pids;
int num_interesting_pids;
int num_blocking_pids;
int dummy;
int i,
j;
/* Validate the passed-in array */
Assert(ARR_ELEMTYPE(interesting_pids_a) == INT4OID);
if (array_contains_nulls(interesting_pids_a))
elog(ERROR, "array must not contain nulls");
interesting_pids = (int32 *) ARR_DATA_PTR(interesting_pids_a);
num_interesting_pids = ArrayGetNItems(ARR_NDIM(interesting_pids_a),
ARR_DIMS(interesting_pids_a));
/*
* Get the PIDs of all sessions blocking the given session's attempt to
* acquire heavyweight locks.
*/
blocking_pids_a =
DatumGetArrayTypeP(DirectFunctionCall1(pg_blocking_pids, blocked_pid));
Assert(ARR_ELEMTYPE(blocking_pids_a) == INT4OID);
Assert(!array_contains_nulls(blocking_pids_a));
blocking_pids = (int32 *) ARR_DATA_PTR(blocking_pids_a);
num_blocking_pids = ArrayGetNItems(ARR_NDIM(blocking_pids_a),
ARR_DIMS(blocking_pids_a));
/*
* Check if any of these are in the list of interesting PIDs, that being
* the sessions that the isolation tester is running. We don't use
* "arrayoverlaps" here, because it would lead to cache lookups and one of
* our goals is to run quickly with debug_discard_caches > 0. We expect
* blocking_pids to be usually empty and otherwise a very small number in
* isolation tester cases, so make that the outer loop of a naive search
* for a match.
*/
for (i = 0; i < num_blocking_pids; i++)
for (j = 0; j < num_interesting_pids; j++)
{
if (blocking_pids[i] == interesting_pids[j])
PG_RETURN_BOOL(true);
}
/*
* Check if blocked_pid is waiting for a safe snapshot. We could in
* theory check the resulting array of blocker PIDs against the
* interesting PIDs list, but since there is no danger of autovacuum
* blocking GetSafeSnapshot there seems to be no point in expending cycles
* on allocating a buffer and searching for overlap; so it's presently
* sufficient for the isolation tester's purposes to use a single element
* buffer and check if the number of safe snapshot blockers is non-zero.
*/
if (GetSafeSnapshotBlockingPids(blocked_pid, &dummy, 1) > 0)
PG_RETURN_BOOL(true);
PG_RETURN_BOOL(false);
}
/*
* Functions for manipulating advisory locks
*
* We make use of the locktag fields as follows:
*
* field1: MyDatabaseId ... ensures locks are local to each database
* field2: first of 2 int4 keys, or high-order half of an int8 key
* field3: second of 2 int4 keys, or low-order half of an int8 key
* field4: 1 if using an int8 key, 2 if using 2 int4 keys
*/
#define SET_LOCKTAG_INT64(tag, key64) \
SET_LOCKTAG_ADVISORY(tag, \
MyDatabaseId, \
(uint32) ((key64) >> 32), \
(uint32) (key64), \
1)
#define SET_LOCKTAG_INT32(tag, key1, key2) \
SET_LOCKTAG_ADVISORY(tag, MyDatabaseId, key1, key2, 2)
/*
* pg_advisory_lock(int8) - acquire exclusive lock on an int8 key
*/
Datum
pg_advisory_lock_int8(PG_FUNCTION_ARGS)
{
int64 key = PG_GETARG_INT64(0);
LOCKTAG tag;
SET_LOCKTAG_INT64(tag, key);
(void) LockAcquire(&tag, ExclusiveLock, true, false);
PG_RETURN_VOID();
}
/*
* pg_advisory_xact_lock(int8) - acquire xact scoped
* exclusive lock on an int8 key
*/
Datum
pg_advisory_xact_lock_int8(PG_FUNCTION_ARGS)
{
int64 key = PG_GETARG_INT64(0);
LOCKTAG tag;
SET_LOCKTAG_INT64(tag, key);
(void) LockAcquire(&tag, ExclusiveLock, false, false);
PG_RETURN_VOID();
}
/*
* pg_advisory_lock_shared(int8) - acquire share lock on an int8 key
*/
Datum
pg_advisory_lock_shared_int8(PG_FUNCTION_ARGS)
{
int64 key = PG_GETARG_INT64(0);
LOCKTAG tag;
SET_LOCKTAG_INT64(tag, key);
(void) LockAcquire(&tag, ShareLock, true, false);
PG_RETURN_VOID();
}
/*
* pg_advisory_xact_lock_shared(int8) - acquire xact scoped
* share lock on an int8 key
*/
Datum
pg_advisory_xact_lock_shared_int8(PG_FUNCTION_ARGS)
{
int64 key = PG_GETARG_INT64(0);
LOCKTAG tag;
SET_LOCKTAG_INT64(tag, key);
(void) LockAcquire(&tag, ShareLock, false, false);
PG_RETURN_VOID();
}
/*
* pg_try_advisory_lock(int8) - acquire exclusive lock on an int8 key, no wait
*
* Returns true if successful, false if lock not available
*/
Datum
pg_try_advisory_lock_int8(PG_FUNCTION_ARGS)
{
int64 key = PG_GETARG_INT64(0);
LOCKTAG tag;
LockAcquireResult res;
SET_LOCKTAG_INT64(tag, key);
res = LockAcquire(&tag, ExclusiveLock, true, true);
PG_RETURN_BOOL(res != LOCKACQUIRE_NOT_AVAIL);
}
/*
* pg_try_advisory_xact_lock(int8) - acquire xact scoped
* exclusive lock on an int8 key, no wait
*
* Returns true if successful, false if lock not available
*/
Datum
pg_try_advisory_xact_lock_int8(PG_FUNCTION_ARGS)
{
int64 key = PG_GETARG_INT64(0);
LOCKTAG tag;
LockAcquireResult res;
SET_LOCKTAG_INT64(tag, key);
res = LockAcquire(&tag, ExclusiveLock, false, true);
PG_RETURN_BOOL(res != LOCKACQUIRE_NOT_AVAIL);
}
/*
* pg_try_advisory_lock_shared(int8) - acquire share lock on an int8 key, no wait
*
* Returns true if successful, false if lock not available
*/
Datum
pg_try_advisory_lock_shared_int8(PG_FUNCTION_ARGS)
{
int64 key = PG_GETARG_INT64(0);
LOCKTAG tag;
LockAcquireResult res;
SET_LOCKTAG_INT64(tag, key);
res = LockAcquire(&tag, ShareLock, true, true);
PG_RETURN_BOOL(res != LOCKACQUIRE_NOT_AVAIL);
}
/*
* pg_try_advisory_xact_lock_shared(int8) - acquire xact scoped
* share lock on an int8 key, no wait
*
* Returns true if successful, false if lock not available
*/
Datum
pg_try_advisory_xact_lock_shared_int8(PG_FUNCTION_ARGS)
{
int64 key = PG_GETARG_INT64(0);
LOCKTAG tag;
LockAcquireResult res;
SET_LOCKTAG_INT64(tag, key);
res = LockAcquire(&tag, ShareLock, false, true);
PG_RETURN_BOOL(res != LOCKACQUIRE_NOT_AVAIL);
}
/*
* pg_advisory_unlock(int8) - release exclusive lock on an int8 key
*
* Returns true if successful, false if lock was not held
*/
Datum
pg_advisory_unlock_int8(PG_FUNCTION_ARGS)
{
int64 key = PG_GETARG_INT64(0);
LOCKTAG tag;
bool res;
SET_LOCKTAG_INT64(tag, key);
res = LockRelease(&tag, ExclusiveLock, true);
PG_RETURN_BOOL(res);
}
/*
* pg_advisory_unlock_shared(int8) - release share lock on an int8 key
*
* Returns true if successful, false if lock was not held
*/
Datum
pg_advisory_unlock_shared_int8(PG_FUNCTION_ARGS)
{
int64 key = PG_GETARG_INT64(0);
LOCKTAG tag;
bool res;
SET_LOCKTAG_INT64(tag, key);
res = LockRelease(&tag, ShareLock, true);
PG_RETURN_BOOL(res);
}
/*
* pg_advisory_lock(int4, int4) - acquire exclusive lock on 2 int4 keys
*/
Datum
pg_advisory_lock_int4(PG_FUNCTION_ARGS)
{
int32 key1 = PG_GETARG_INT32(0);
int32 key2 = PG_GETARG_INT32(1);
LOCKTAG tag;
SET_LOCKTAG_INT32(tag, key1, key2);
(void) LockAcquire(&tag, ExclusiveLock, true, false);
PG_RETURN_VOID();
}
/*
* pg_advisory_xact_lock(int4, int4) - acquire xact scoped
* exclusive lock on 2 int4 keys
*/
Datum
pg_advisory_xact_lock_int4(PG_FUNCTION_ARGS)
{
int32 key1 = PG_GETARG_INT32(0);
int32 key2 = PG_GETARG_INT32(1);
LOCKTAG tag;
SET_LOCKTAG_INT32(tag, key1, key2);
(void) LockAcquire(&tag, ExclusiveLock, false, false);
PG_RETURN_VOID();
}
/*
* pg_advisory_lock_shared(int4, int4) - acquire share lock on 2 int4 keys
*/
Datum
pg_advisory_lock_shared_int4(PG_FUNCTION_ARGS)
{
int32 key1 = PG_GETARG_INT32(0);
int32 key2 = PG_GETARG_INT32(1);
LOCKTAG tag;
SET_LOCKTAG_INT32(tag, key1, key2);
(void) LockAcquire(&tag, ShareLock, true, false);
PG_RETURN_VOID();
}
/*
* pg_advisory_xact_lock_shared(int4, int4) - acquire xact scoped
* share lock on 2 int4 keys
*/
Datum
pg_advisory_xact_lock_shared_int4(PG_FUNCTION_ARGS)
{
int32 key1 = PG_GETARG_INT32(0);
int32 key2 = PG_GETARG_INT32(1);
LOCKTAG tag;
SET_LOCKTAG_INT32(tag, key1, key2);
(void) LockAcquire(&tag, ShareLock, false, false);
PG_RETURN_VOID();
}
/*
* pg_try_advisory_lock(int4, int4) - acquire exclusive lock on 2 int4 keys, no wait
*
* Returns true if successful, false if lock not available
*/
Datum
pg_try_advisory_lock_int4(PG_FUNCTION_ARGS)
{
int32 key1 = PG_GETARG_INT32(0);
int32 key2 = PG_GETARG_INT32(1);
LOCKTAG tag;
LockAcquireResult res;
SET_LOCKTAG_INT32(tag, key1, key2);
res = LockAcquire(&tag, ExclusiveLock, true, true);
PG_RETURN_BOOL(res != LOCKACQUIRE_NOT_AVAIL);
}
/*
* pg_try_advisory_xact_lock(int4, int4) - acquire xact scoped
* exclusive lock on 2 int4 keys, no wait
*
* Returns true if successful, false if lock not available
*/
Datum
pg_try_advisory_xact_lock_int4(PG_FUNCTION_ARGS)
{
int32 key1 = PG_GETARG_INT32(0);
int32 key2 = PG_GETARG_INT32(1);
LOCKTAG tag;
LockAcquireResult res;
SET_LOCKTAG_INT32(tag, key1, key2);
res = LockAcquire(&tag, ExclusiveLock, false, true);
PG_RETURN_BOOL(res != LOCKACQUIRE_NOT_AVAIL);
}
/*
* pg_try_advisory_lock_shared(int4, int4) - acquire share lock on 2 int4 keys, no wait
*
* Returns true if successful, false if lock not available
*/
Datum
pg_try_advisory_lock_shared_int4(PG_FUNCTION_ARGS)
{
int32 key1 = PG_GETARG_INT32(0);
int32 key2 = PG_GETARG_INT32(1);
LOCKTAG tag;
LockAcquireResult res;
SET_LOCKTAG_INT32(tag, key1, key2);
res = LockAcquire(&tag, ShareLock, true, true);
PG_RETURN_BOOL(res != LOCKACQUIRE_NOT_AVAIL);
}
/*
* pg_try_advisory_xact_lock_shared(int4, int4) - acquire xact scoped
* share lock on 2 int4 keys, no wait
*
* Returns true if successful, false if lock not available
*/
Datum
pg_try_advisory_xact_lock_shared_int4(PG_FUNCTION_ARGS)
{
int32 key1 = PG_GETARG_INT32(0);
int32 key2 = PG_GETARG_INT32(1);
LOCKTAG tag;
LockAcquireResult res;
SET_LOCKTAG_INT32(tag, key1, key2);
res = LockAcquire(&tag, ShareLock, false, true);
PG_RETURN_BOOL(res != LOCKACQUIRE_NOT_AVAIL);
}
/*
* pg_advisory_unlock(int4, int4) - release exclusive lock on 2 int4 keys
*
* Returns true if successful, false if lock was not held
*/
Datum
pg_advisory_unlock_int4(PG_FUNCTION_ARGS)
{
int32 key1 = PG_GETARG_INT32(0);
int32 key2 = PG_GETARG_INT32(1);
LOCKTAG tag;
bool res;
SET_LOCKTAG_INT32(tag, key1, key2);
res = LockRelease(&tag, ExclusiveLock, true);
PG_RETURN_BOOL(res);
}
/*
* pg_advisory_unlock_shared(int4, int4) - release share lock on 2 int4 keys
*
* Returns true if successful, false if lock was not held
*/
Datum
pg_advisory_unlock_shared_int4(PG_FUNCTION_ARGS)
{
int32 key1 = PG_GETARG_INT32(0);
int32 key2 = PG_GETARG_INT32(1);
LOCKTAG tag;
bool res;
SET_LOCKTAG_INT32(tag, key1, key2);
res = LockRelease(&tag, ShareLock, true);
PG_RETURN_BOOL(res);
}
/*
* pg_advisory_unlock_all() - release all advisory locks
*/
Datum
pg_advisory_unlock_all(PG_FUNCTION_ARGS)
{
LockReleaseSession(USER_LOCKMETHOD);
PG_RETURN_VOID();
}