In this case, the previous implementation counted an extra number of
opcodes to cache and the matching was unstable on memoization.
This patch is to fix that problem by not counting an number of opcodes
to cache in the parentheses of `(...){0}`.
The `MEMOIZE_LOOKAROUND_MATCH_CACHE_POINT` macro needs an argument
otherwise we end up with:
```
../regexec.c:3955:2: error: called object type 'void' is not a function or function pointer
3955 | STACK_POS_END(stkp);
| ^~~~~~~~~~~~~~~~~~~
../regexec.c:1680:41: note: expanded from macro 'STACK_POS_END'
1680 | MEMOIZE_LOOKAROUND_MATCH_CACHE_POINT(k);\
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
../regexec.c:3969:7: error: called object type 'void' is not a function or function pointer
3969 | STACK_POP_TIL_POS_NOT;
| ^~~~~~~~~~~~~~~~~~~~~
../regexec.c:1616:41: note: expanded from macro 'STACK_POP_TIL_POS_NOT'
1616 | MEMOIZE_LOOKAROUND_MATCH_CACHE_POINT(stk);\
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
```
The macro definition with the match cache enabled already has the
correct argument. This one is for when the match cache is disabled (I
had disabled it while trying to learn more about how it works.)
As of 10574857ce167869524b97ee862b610928f6272f, it's possible to crash
on a double free due to `stk_alloc` AKA `msa->stack_p` being freed
twice, once at the end of match_at and a second time in `FREE_MATCH_ARG`
in the parent caller.
Fixes [Bug #20886]
[Bug #20653]
This commit refactors how Onigmo handles timeout. Instead of raising a
timeout error, onig_search will return a ONIGERR_TIMEOUT which the
caller can free memory, and then raise a timeout error.
This fixes a memory leak in String#start_with when the regexp times out.
For example:
regex = Regexp.new("^#{"(a*)" * 10_000}x$", timeout: 0.000001)
str = "a" * 1000000 + "x"
10.times do
100.times do
str.start_with?(regex)
rescue
end
puts `ps -o rss= -p #{$$}`
end
Before:
33216
51936
71152
81728
97152
103248
120384
133392
133520
133616
After:
14912
15376
15824
15824
16128
16128
16144
16144
16160
16160
[Bug #20650]
The capture group allocates memory that is leaked when it times out.
For example:
re = Regexp.new("^#{"(a*)" * 10_000}x$", timeout: 0.000001)
str = "a" * 1000000 + "x"
10.times do
100.times do
re =~ str
rescue Regexp::TimeoutError
end
puts `ps -o rss= -p #{$$}`
end
Before:
34688
56416
78288
100368
120784
140704
161904
183568
204320
224800
After:
16288
16288
16880
16896
16912
16928
16944
17184
17184
17200
https://bugs.ruby-lang.org/issues/20228 started freeing `stk_base` to
avoid a memory leak. But `stk_base` is sometimes stack allocated (using
`xalloca`), so the free only works if the regex stack has grown enough
to hit `stack_double` (which uses `xmalloc` and `xrealloc`).
To reproduce the problem on master and 3.3.1:
```ruby
Regexp.timeout = 0.001
/^(a*)x$/ =~ "a" * 1000000 + "x"'
```
Some details about this potential fix:
`stk_base == stk_alloc` on
[init](dde99215f2/regexec.c (L1153)),
so if `stk_base != stk_alloc` we can be sure we called
[`stack_double`](dde99215f2/regexec.c (L1210))
and it's safe to free. It's also safe to free if we've
[saved](dde99215f2/regexec.c (L1187-L1189))
the stack to `msa->stack_p`, since we do the `stk_base != stk_alloc`
check before saving.
This matches the check we do inside
[`stack_double`](dde99215f2/regexec.c (L1221))
When matching against an incomplete character, some `enclen` calls are
expected not to exceed the limit, and some are expected to return the
required length and then the results are checked if it exceeds.
Fix [Bug #20207]
Fix [Bug #20212]
Handling consecutive lookarounds in init_cache_opcodes is buggy, so it
causes invalid memory access reported in [Bug #20207] and [Bug #20212].
This fixes it by using recursive functions to detected lookarounds
nesting correctly.
Previously the following read and wrote 1 byte out-of-bounds:
$ valgrind ruby -e 'p /(\W+)[bx]\?/i.match? "aaaaaa aaaaaaaaa aaaa aaaaaaaa aaa aaaaxaaaaaaaaaaa aaaaa aaaaaaaaaaaa a ? aaa aaaa a ?"' 2> >(grep Invalid -A 30)
Because of the `match_cache_point_index + 1` in
memoize_extended_match_cache_point() and
check_extended_match_cache_point(), we need one more byte of space.
rb_reg_onig_match performs preparation, error handling, and cleanup for
matching a regex against a string. This reduces repetitive code and
removes the need for StringScanner to access internal data of regex.
According to the C99 specification section 7.20.3.2 paragraph 2:
> If ptr is a null pointer, no action occurs.
So we do not need to check that the pointer is a null pointer.
* Refactor Regexp#match cache implementation
Improved variable and function names
Fixed [Bug 19537] (Maybe fixed in https://github.com/ruby/ruby/pull/7694)
* Add a comment of the glossary for "match cache"
* Skip to reset match cache when no cache point on null check
On platforms where unaligned word access is not allowed, and if
`sizeof(val)` and `sizeof(type)` differ:
- `val` > `type`, `val` will be a garbage.
- `val` < `type`, outside `val` will be clobbered.