The local buffer manager is no longer used for newly-created relations
(unless they are TEMP); a new non-TEMP relation goes through the shared
bufmgr and thus will participate normally in checkpoints. But TEMP relations
use the local buffer manager throughout their lifespan. Also, operations
in TEMP relations are not logged in WAL, thus improving performance.
Since it's no longer necessary to fsync relations as they move out of the
local buffers into shared buffers, quite a lot of smgr.c/md.c/fd.c code
is no longer needed and has been removed: there's no concept of a dirty
relation anymore in md.c/fd.c, and we never fsync anything but WAL.
Still TODO: improve local buffer management algorithms so that it would
be reasonable to increase NLocBuffer.
lines of code into internal routines (drop_relfilenode_buffers,
release_buffer) and by hiding unused routines (PrintBufferDescs,
PrintPinnedBufs) behind #ifdef NOT_USED. Remove AbortBufferIO()
declaration from bufmgr.c (already declared in bufmgr.h)
Manfred Koizar
was in the thread "make BufferGetBlockNumber() a macro". Tom
objected to the original patch, so I prepared a new one which
doesn't change BufferGetBlockNumber() into a macro, it just
cleans up some comments and fixes an assertion. The patch
is attached.
Neil Conway
stub) into the rest of the system. Adopt a cleaner approach to preventing
deadlock in concurrent heap_updates: allow RelationGetBufferForTuple to
select any page of the rel, and put the onus on it to lock both buffers
in a consistent order. Remove no-longer-needed isExtend hack from
API of ReleaseAndReadBuffer.
appropriate pin-count manipulation, and instead use ReleaseAndReadBuffer.
Make use of the fact that the passed-in buffer (if there is one) must
be pinned to avoid grabbing the bufmgr spinlock when we are able to
return this same buffer. Eliminate unnecessary 'previous tuple' and
'next tuple' fields of HeapScanDesc and IndexScanDesc, thereby removing
a whole lot of bookkeeping from heap_getnext() and related routines.
when we need to move to a new page; as long as we can insert the new
tuple on the same page as before, we only need LockBuffer and not the
expensive stuff. Also, twiddle bufmgr interfaces to avoid redundant
lseeks in RelationGetBufferForTuple and BufferAlloc. Successive inserts
now require one lseek per page added, rather than one per tuple with
several additional ones at each page boundary as happened before.
Lock contention when multiple backends are inserting in same table
is also greatly reduced.
to ensure that we have released buffer refcounts and so forth, rather than
putting ad-hoc operations before (some of the calls to) proc_exit. Add
commentary to discourage future hackers from repeating that mistake.
included by everything that includes bufmgr.h --- it's supposed to be
internals, after all, not part of the API! This fixes the conflict
against FreeBSD headers reported by Rosenman, by making it unnecessary
for s_lock.h to be included by plperl.c.
IPC key assignment will now work correctly even when multiple postmasters
are using same logical port number (which is possible given -k switch).
There is only one shared-mem segment per postmaster now, not 3.
Rip out broken code for non-TAS case in bufmgr and xlog, substitute a
complete S_LOCK emulation using semaphores in spin.c. TAS and non-TAS
logic is now exactly the same.
When deadlock is detected, "Deadlock detected" is now the elog(ERROR)
message, rather than a NOTICE that comes out before an unhelpful ERROR.
(WAL logging for this is not done yet, however.) Clean up a number of really
crufty things that are no longer needed now that DROP behaves nicely. Make
temp table mapper do the right things when drop or rename affecting a temp
table is rolled back. Also, remove "relation modified while in use" error
check, in favor of locking tables at first reference and holding that lock
throughout the statement.
as MaxHeapAttributeNumber. Increase MaxAttrSize to something more
reasonable (given what it's used for, namely checking char(n) declarations,
I didn't make it the full 1G that it could theoretically be --- 10Mb
seemed a more reasonable number). Improve calculation of MaxTupleSize.
Hiroshi. ReleaseRelationBuffers now removes rel's buffers from pool,
instead of merely marking them nondirty. The old code would leave valid
buffers for a deleted relation, which didn't cause any known problems
but can't possibly be a good idea. There were several places which called
ReleaseRelationBuffers *and* FlushRelationBuffers, which is now
unnecessary; but there were others that did not. FlushRelationBuffers
no longer emits a warning notice if it finds dirty buffers to flush,
because with the current bufmgr behavior that's not an unexpected
condition. Also, FlushRelationBuffers will flush out all dirty buffers
for the relation regardless of block number. This ensures that
pg_upgrade's expectations are met about tuple on-row status bits being
up-to-date on disk. Lastly, tweak BufTableDelete() to clear the
buffer's tag so that no one can mistake it for being a still-valid
buffer for the page it once held. Formerly, the buffer would not be
found by buffer hashtable searches after BufTableDelete(), but it would
still be thought to belong to its old relation by the routines that
sequentially scan the shared-buffer array. Again I know of no bugs
caused by that, but it still can't be a good idea.
as a shared dirtybit for each shared buffer. The shared dirtybit still
controls writing the buffer, but the local bit controls whether we need
to fsync the buffer's file. This arrangement fixes a bug that allowed
some required fsyncs to be missed, and should improve performance as well.
For more info see my post of same date on pghackers.
In the event of an elog() while the mode was set to immediate write,
there was no way for it to be set back to the normal delayed write.
The mechanism was a waste of space and cycles anyway, since the only user
was varsup.c, which could perfectly well call FlushBuffer directly.
Now it does just that, and the notion of a write mode is gone.
* Buffer refcount cleanup (per my "progress report" to pghackers, 9/22).
* Add links to backend PROC structs to sinval's array of per-backend info,
and use these links for routines that need to check the state of all
backends (rather than the slow, complicated search of the ShmemIndex
hashtable that was used before). Add databaseOID to PROC structs.
* Use this to implement an interlock that prevents DESTROY DATABASE of
a database containing running backends. (It's a little tricky to prevent
a concurrently-starting backend from getting in there, since the new
backend is not able to lock anything at the time it tries to look up
its database in pg_database. My solution is to recheck that the DB is
OK at the end of InitPostgres. It may not be a 100% solution, but it's
a lot better than no interlock at all...)
* In ALTER TABLE RENAME, flush buffers for the relation before doing the
rename of the physical files, to ensure we don't get failures later from
mdblindwrt().
* Update TRUNCATE patch so that it actually compiles against current
sources :-(.
You should do "make clean all" after pulling these changes.
no longer returns buffer pointer, can be gotten from scan;
descriptor; bootstrap can create multi-key indexes;
pg_procname index now is multi-key index; oidint2, oidint4, oidname
are gone (must be removed from regression tests); use System Cache
rather than sequential scan in many places; heap_modifytuple no
longer takes buffer parameter; remove unused buffer parameter in
a few other functions; oid8 is not index-able; remove some use of
single-character variable names; cleanup Buffer variables usage
and scan descriptor looping; cleaned up allocation and freeing of
tuples; 18k lines of diff;
==========================================
What follows is a set of diffs that cleans up the usage of BLCKSZ.
As a side effect, the person compiling the code can change the
value of BLCKSZ _at_their_own_risk_. By that, I mean that I've
tried it here at 4096 and 16384 with no ill-effects. A value
of 4096 _shouldn't_ affect much as far as the kernel/file system
goes, but making it bigger than 8192 can have severe consequences
if you don't know what you're doing. 16394 worked for me, _BUT_
when I went to 32768 and did an initdb, the SCSI driver broke and
the partition that I was running under went to hell in a hand
basket. Had to reboot and do a good bit of fsck'ing to fix things up.
The patch can be safely applied though. Just leave BLCKSZ = 8192
and everything is as before. It basically only cleans up all of the
references to BLCKSZ in the code.
If this patch is applied, a comment in the config.h file though above
the BLCKSZ define with warning about monkeying around with it would
be a good idea.
Darren darrenk@insightdist.com
(Also cleans up some of the #includes in files referencing BLCKSZ.)
==========================================