There is a race between when `Thread._tstate_lock` is released[^1] in `Thread._wait_for_tstate_lock()`
and when `Thread._stop()` asserts[^2] that it is unlocked. Consider the following execution
involving threads A, B, and C:
1. A starts.
2. B joins A, blocking on its `_tstate_lock`.
3. C joins A, blocking on its `_tstate_lock`.
4. A finishes and releases its `_tstate_lock`.
5. B acquires A's `_tstate_lock` in `_wait_for_tstate_lock()`, releases it, but is swapped
out before calling `_stop()`.
6. C is scheduled, acquires A's `_tstate_lock` in `_wait_for_tstate_lock()` but is swapped
out before releasing it.
7. B is scheduled, calls `_stop()`, which asserts that A's `_tstate_lock` is not held.
However, C holds it, so the assertion fails.
The race can be reproduced[^3] by inserting sleeps at the appropriate points in
the threading code. To do so, run the `repro_join_race.py` from the linked repo.
There are two main parts to this PR:
1. `_tstate_lock` is replaced with an event that is attached to `PyThreadState`.
The event is set by the runtime prior to the thread being cleared (in the same
place that `_tstate_lock` was released). `Thread.join()` blocks waiting for the
event to be set.
2. `_PyInterpreterState_WaitForThreads()` provides the ability to wait for all
non-daemon threads to exit. To do so, an `is_daemon` predicate was added to
`PyThreadState`. This field is set each time a thread is created. `threading._shutdown()`
now calls into `_PyInterpreterState_WaitForThreads()` instead of waiting on
`_tstate_lock`s.
[^1]: 441affc9e7/Lib/threading.py (L1201)
[^2]: 441affc9e7/Lib/threading.py (L1115)
[^3]: 8194653279
---------
Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
On Windows, time.monotonic() now uses the QueryPerformanceCounter()
clock to have a resolution better than 1 us, instead of the
gGetTickCount64() clock which has a resolution of 15.6 ms.
There are now at least two bytecodes that may attempt to optimize,
JUMP_BACK, and more recently, COLD_EXIT.
Only the JUMP_BACK was counting the attempt in the stats.
This moves that counter to uop_optimize itself so it should
always happen no matter where it is called from.
This isn't strictly necessary because the implementation of `gc_should_collect`
already checks `gcstate->enabled` in the free-threaded build, but it seems
like a good idea until the common pieces of gc.c and gc_free_threading.c are
refactored out.
This moves `current_fast_clear()` up so that the current thread state is
`NULL` while running `tstate_delete_common()`.
This doesn't fix any bugs, but it means that we are more consistent that
`_PyThreadState_GET() != NULL` means that the thread is "attached".
Return 0 on success. Set an exception and return -1 on error.
Fix os.timerfd_settime(): properly report exceptions on
_PyTime_FromSecondsDouble() failure.
No longer export _PyTime_FromSecondsDouble().
In free-threaded builds, running with `PYTHON_GIL=0` will now disable the
GIL. Follow-up issues track work to re-enable the GIL when loading an
incompatible extension, and to disable the GIL by default.
In order to support re-enabling the GIL at runtime, all GIL-related data
structures are initialized as usual, and disabling the GIL simply sets a flag
that causes `take_gil()` and `drop_gil()` to return early.
In general, when `_PyThreadState_GET()` is non-NULL then the current
thread is "attached", but there is a small window during
`PyThreadState_DeleteCurrent()` where that's not true:
tstate_delete_common() is called when the thread is detached, but before
current_fast_clear().
Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
If a thread blocks while waiting on the `shared->mutex` lock, the array
of QSBR states may be reallocated. The `tstate->qsbr` values before the
lock is acquired may not be the same as the value after the lock is acquired.
This implements the delayed reuse of mimalloc pages that contain Python
objects in the free-threaded build.
Allocations of the same size class are grouped in data structures called
pages. These are different from operating system pages. For thread-safety, we
want to ensure that memory used to store PyObjects remains valid as long as
there may be concurrent lock-free readers; we want to delay using it for
other size classes, in other heaps, or returning it to the operating system.
When a mimalloc page becomes empty, instead of immediately freeing it, we tag
it with a QSBR goal and insert it into a per-thread state linked list of
pages to be freed. When mimalloc needs a fresh page, we process the queue and
free any still empty pages that are now deemed safe to be freed. Pages
waiting to be freed are still available for allocations of the same size
class and allocating from a page prevent it from being freed. There is
additional logic to handle abandoned pages when threads exit.
A previous commit introduced a bug to `interpreter_clear()`: it set
`interp->ceval.instrumentation_version` to 0, without making the corresponding
change to `tstate->eval_breaker` (which holds a thread-local copy of the
version). After this happens, Python code can still run due to object finalizers
during a GC, and the version check in bytecodes.c will see a different result
than the one in instrumentation.c causing an infinite loop.
The fix itself is straightforward: clear `tstate->eval_breaker` when clearing
`interp->ceval.instrumentation_version`.