Vendor things
This commit is contained in:
parent
5deceec006
commit
977e3c17e5
19434 changed files with 10682014 additions and 0 deletions
55
third-party/vendor/regex-automata/src/nfa/mod.rs
vendored
Normal file
55
third-party/vendor/regex-automata/src/nfa/mod.rs
vendored
Normal file
|
|
@ -0,0 +1,55 @@
|
|||
/*!
|
||||
Provides non-deterministic finite automata (NFA) and regex engines that use
|
||||
them.
|
||||
|
||||
While NFAs and DFAs (deterministic finite automata) have equivalent *theoretical*
|
||||
power, their usage in practice tends to result in different engineering trade
|
||||
offs. While this isn't meant to be a comprehensive treatment of the topic, here
|
||||
are a few key trade offs that are, at minimum, true for this crate:
|
||||
|
||||
* NFAs tend to be represented sparsely where as DFAs are represented densely.
|
||||
Sparse representations use less memory, but are slower to traverse. Conversely,
|
||||
dense representations use more memory, but are faster to traverse. (Sometimes
|
||||
these lines are blurred. For example, an `NFA` might choose to represent a
|
||||
particular state in a dense fashion, and a DFA can be built using a sparse
|
||||
representation via [`sparse::DFA`](crate::dfa::sparse::DFA).
|
||||
* NFAs have espilon transitions and DFAs don't. In practice, this means that
|
||||
handling a single byte in a haystack with an NFA at search time may require
|
||||
visiting multiple NFA states. In a DFA, each byte only requires visiting
|
||||
a single state. Stated differently, NFAs require a variable number of CPU
|
||||
instructions to process one byte in a haystack where as a DFA uses a constant
|
||||
number of CPU instructions to process one byte.
|
||||
* NFAs are generally easier to amend with secondary storage. For example, the
|
||||
[`thompson::pikevm::PikeVM`] uses an NFA to match, but also uses additional
|
||||
memory beyond the model of a finite state machine to track offsets for matching
|
||||
capturing groups. Conversely, the most a DFA can do is report the offset (and
|
||||
pattern ID) at which a match occurred. This is generally why we also compile
|
||||
DFAs in reverse, so that we can run them after finding the end of a match to
|
||||
also find the start of a match.
|
||||
* NFAs take worst case linear time to build, but DFAs take worst case
|
||||
exponential time to build. The [hybrid NFA/DFA](crate::hybrid) mitigates this
|
||||
challenge for DFAs in many practical cases.
|
||||
|
||||
There are likely other differences, but the bottom line is that NFAs tend to be
|
||||
more memory efficient and give easier opportunities for increasing expressive
|
||||
power, where as DFAs are faster to search with.
|
||||
|
||||
# Why only a Thompson NFA?
|
||||
|
||||
Currently, the only kind of NFA we support in this crate is a [Thompson
|
||||
NFA](https://en.wikipedia.org/wiki/Thompson%27s_construction). This refers
|
||||
to a specific construction algorithm that takes the syntax of a regex
|
||||
pattern and converts it to an NFA. Specifically, it makes gratuitous use of
|
||||
epsilon transitions in order to keep its structure simple. In exchange, its
|
||||
construction time is linear in the size of the regex. A Thompson NFA also makes
|
||||
the guarantee that given any state and a character in a haystack, there is at
|
||||
most one transition defined for it. (Although there may be many epsilon
|
||||
transitions.)
|
||||
|
||||
It possible that other types of NFAs will be added in the future, such as a
|
||||
[Glushkov NFA](https://en.wikipedia.org/wiki/Glushkov%27s_construction_algorithm).
|
||||
But currently, this crate only provides a Thompson NFA.
|
||||
*/
|
||||
|
||||
#[cfg(feature = "nfa-thompson")]
|
||||
pub mod thompson;
|
||||
1908
third-party/vendor/regex-automata/src/nfa/thompson/backtrack.rs
vendored
Normal file
1908
third-party/vendor/regex-automata/src/nfa/thompson/backtrack.rs
vendored
Normal file
File diff suppressed because it is too large
Load diff
1337
third-party/vendor/regex-automata/src/nfa/thompson/builder.rs
vendored
Normal file
1337
third-party/vendor/regex-automata/src/nfa/thompson/builder.rs
vendored
Normal file
File diff suppressed because it is too large
Load diff
2346
third-party/vendor/regex-automata/src/nfa/thompson/compiler.rs
vendored
Normal file
2346
third-party/vendor/regex-automata/src/nfa/thompson/compiler.rs
vendored
Normal file
File diff suppressed because it is too large
Load diff
185
third-party/vendor/regex-automata/src/nfa/thompson/error.rs
vendored
Normal file
185
third-party/vendor/regex-automata/src/nfa/thompson/error.rs
vendored
Normal file
|
|
@ -0,0 +1,185 @@
|
|||
use crate::util::{
|
||||
captures, look,
|
||||
primitives::{PatternID, StateID},
|
||||
};
|
||||
|
||||
/// An error that can occurred during the construction of a thompson NFA.
|
||||
///
|
||||
/// This error does not provide many introspection capabilities. There are
|
||||
/// generally only two things you can do with it:
|
||||
///
|
||||
/// * Obtain a human readable message via its `std::fmt::Display` impl.
|
||||
/// * Access an underlying [`regex_syntax::Error`] type from its `source`
|
||||
/// method via the `std::error::Error` trait. This error only occurs when using
|
||||
/// convenience routines for building an NFA directly from a pattern string.
|
||||
///
|
||||
/// Otherwise, errors typically occur when a limit has been breeched. For
|
||||
/// example, if the total heap usage of the compiled NFA exceeds the limit
|
||||
/// set by [`Config::nfa_size_limit`](crate::nfa::thompson::Config), then
|
||||
/// building the NFA will fail.
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct BuildError {
|
||||
kind: BuildErrorKind,
|
||||
}
|
||||
|
||||
/// The kind of error that occurred during the construction of a thompson NFA.
|
||||
#[derive(Clone, Debug)]
|
||||
enum BuildErrorKind {
|
||||
/// An error that occurred while parsing a regular expression. Note that
|
||||
/// this error may be printed over multiple lines, and is generally
|
||||
/// intended to be end user readable on its own.
|
||||
#[cfg(feature = "syntax")]
|
||||
Syntax(regex_syntax::Error),
|
||||
/// An error that occurs if the capturing groups provided to an NFA builder
|
||||
/// do not satisfy the documented invariants. For example, things like
|
||||
/// too many groups, missing groups, having the first (zeroth) group be
|
||||
/// named or duplicate group names within the same pattern.
|
||||
Captures(captures::GroupInfoError),
|
||||
/// An error that occurs when an NFA contains a Unicode word boundary, but
|
||||
/// where the crate was compiled without the necessary data for dealing
|
||||
/// with Unicode word boundaries.
|
||||
Word(look::UnicodeWordBoundaryError),
|
||||
/// An error that occurs if too many patterns were given to the NFA
|
||||
/// compiler.
|
||||
TooManyPatterns {
|
||||
/// The number of patterns given, which exceeds the limit.
|
||||
given: usize,
|
||||
/// The limit on the number of patterns.
|
||||
limit: usize,
|
||||
},
|
||||
/// An error that occurs if too states are produced while building an NFA.
|
||||
TooManyStates {
|
||||
/// The minimum number of states that are desired, which exceeds the
|
||||
/// limit.
|
||||
given: usize,
|
||||
/// The limit on the number of states.
|
||||
limit: usize,
|
||||
},
|
||||
/// An error that occurs when NFA compilation exceeds a configured heap
|
||||
/// limit.
|
||||
ExceededSizeLimit {
|
||||
/// The configured limit, in bytes.
|
||||
limit: usize,
|
||||
},
|
||||
/// An error that occurs when an invalid capture group index is added to
|
||||
/// the NFA. An "invalid" index can be one that would otherwise overflow
|
||||
/// a `usize` on the current target.
|
||||
InvalidCaptureIndex {
|
||||
/// The invalid index that was given.
|
||||
index: u32,
|
||||
},
|
||||
/// An error that occurs when one tries to build a reverse NFA with
|
||||
/// captures enabled. Currently, this isn't supported, but we probably
|
||||
/// should support it at some point.
|
||||
#[cfg(feature = "syntax")]
|
||||
UnsupportedCaptures,
|
||||
}
|
||||
|
||||
impl BuildError {
|
||||
/// If this error occurred because the NFA exceeded the configured size
|
||||
/// limit before being built, then this returns the configured size limit.
|
||||
///
|
||||
/// The limit returned is what was configured, and corresponds to the
|
||||
/// maximum amount of heap usage in bytes.
|
||||
pub fn size_limit(&self) -> Option<usize> {
|
||||
match self.kind {
|
||||
BuildErrorKind::ExceededSizeLimit { limit } => Some(limit),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
|
||||
fn kind(&self) -> &BuildErrorKind {
|
||||
&self.kind
|
||||
}
|
||||
|
||||
#[cfg(feature = "syntax")]
|
||||
pub(crate) fn syntax(err: regex_syntax::Error) -> BuildError {
|
||||
BuildError { kind: BuildErrorKind::Syntax(err) }
|
||||
}
|
||||
|
||||
pub(crate) fn captures(err: captures::GroupInfoError) -> BuildError {
|
||||
BuildError { kind: BuildErrorKind::Captures(err) }
|
||||
}
|
||||
|
||||
pub(crate) fn word(err: look::UnicodeWordBoundaryError) -> BuildError {
|
||||
BuildError { kind: BuildErrorKind::Word(err) }
|
||||
}
|
||||
|
||||
pub(crate) fn too_many_patterns(given: usize) -> BuildError {
|
||||
let limit = PatternID::LIMIT;
|
||||
BuildError { kind: BuildErrorKind::TooManyPatterns { given, limit } }
|
||||
}
|
||||
|
||||
pub(crate) fn too_many_states(given: usize) -> BuildError {
|
||||
let limit = StateID::LIMIT;
|
||||
BuildError { kind: BuildErrorKind::TooManyStates { given, limit } }
|
||||
}
|
||||
|
||||
pub(crate) fn exceeded_size_limit(limit: usize) -> BuildError {
|
||||
BuildError { kind: BuildErrorKind::ExceededSizeLimit { limit } }
|
||||
}
|
||||
|
||||
pub(crate) fn invalid_capture_index(index: u32) -> BuildError {
|
||||
BuildError { kind: BuildErrorKind::InvalidCaptureIndex { index } }
|
||||
}
|
||||
|
||||
#[cfg(feature = "syntax")]
|
||||
pub(crate) fn unsupported_captures() -> BuildError {
|
||||
BuildError { kind: BuildErrorKind::UnsupportedCaptures }
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(feature = "std")]
|
||||
impl std::error::Error for BuildError {
|
||||
fn source(&self) -> Option<&(dyn std::error::Error + 'static)> {
|
||||
match self.kind() {
|
||||
#[cfg(feature = "syntax")]
|
||||
BuildErrorKind::Syntax(ref err) => Some(err),
|
||||
BuildErrorKind::Captures(ref err) => Some(err),
|
||||
_ => None,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl core::fmt::Display for BuildError {
|
||||
fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
|
||||
match self.kind() {
|
||||
#[cfg(feature = "syntax")]
|
||||
BuildErrorKind::Syntax(_) => write!(f, "error parsing regex"),
|
||||
BuildErrorKind::Captures(_) => {
|
||||
write!(f, "error with capture groups")
|
||||
}
|
||||
BuildErrorKind::Word(_) => {
|
||||
write!(f, "NFA contains Unicode word boundary")
|
||||
}
|
||||
BuildErrorKind::TooManyPatterns { given, limit } => write!(
|
||||
f,
|
||||
"attempted to compile {} patterns, \
|
||||
which exceeds the limit of {}",
|
||||
given, limit,
|
||||
),
|
||||
BuildErrorKind::TooManyStates { given, limit } => write!(
|
||||
f,
|
||||
"attempted to compile {} NFA states, \
|
||||
which exceeds the limit of {}",
|
||||
given, limit,
|
||||
),
|
||||
BuildErrorKind::ExceededSizeLimit { limit } => write!(
|
||||
f,
|
||||
"heap usage during NFA compilation exceeded limit of {}",
|
||||
limit,
|
||||
),
|
||||
BuildErrorKind::InvalidCaptureIndex { index } => write!(
|
||||
f,
|
||||
"capture group index {} is invalid (too big or discontinuous)",
|
||||
index,
|
||||
),
|
||||
#[cfg(feature = "syntax")]
|
||||
BuildErrorKind::UnsupportedCaptures => write!(
|
||||
f,
|
||||
"currently captures must be disabled when compiling \
|
||||
a reverse NFA",
|
||||
),
|
||||
}
|
||||
}
|
||||
}
|
||||
528
third-party/vendor/regex-automata/src/nfa/thompson/literal_trie.rs
vendored
Normal file
528
third-party/vendor/regex-automata/src/nfa/thompson/literal_trie.rs
vendored
Normal file
|
|
@ -0,0 +1,528 @@
|
|||
use core::mem;
|
||||
|
||||
use alloc::{vec, vec::Vec};
|
||||
|
||||
use crate::{
|
||||
nfa::thompson::{self, compiler::ThompsonRef, BuildError, Builder},
|
||||
util::primitives::{IteratorIndexExt, StateID},
|
||||
};
|
||||
|
||||
/// A trie that preserves leftmost-first match semantics.
|
||||
///
|
||||
/// This is a purpose-built data structure for optimizing 'lit1|lit2|..|litN'
|
||||
/// patterns. It can *only* handle alternations of literals, which makes it
|
||||
/// somewhat restricted in its scope, but literal alternations are fairly
|
||||
/// common.
|
||||
///
|
||||
/// At a 5,000 foot level, the main idea of this trie is make an alternation of
|
||||
/// literals look more like a DFA than an NFA via epsilon removal.
|
||||
///
|
||||
/// More precisely, the main issue is in how alternations are compiled into
|
||||
/// a Thompson NFA. Namely, each alternation gets a single NFA "union" state
|
||||
/// with an epsilon transition for every branch of the alternation pointing to
|
||||
/// an NFA state corresponding to the start of that branch. The main problem
|
||||
/// with this representation is the cost of computing an epsilon closure. Once
|
||||
/// you hit the alternation's start state, it acts as a sort of "clog" that
|
||||
/// requires you to traverse all of the epsilon transitions to compute the full
|
||||
/// closure.
|
||||
///
|
||||
/// While fixing such clogs in the general case is pretty tricky without going
|
||||
/// to a DFA (or perhaps a Glushkov NFA, but that comes with other problems).
|
||||
/// But at least in the case of an alternation of literals, we can convert
|
||||
/// that to a prefix trie without too much cost. In theory, that's all you
|
||||
/// really need to do: build the trie and then compile it to a Thompson NFA.
|
||||
/// For example, if you have the pattern 'bar|baz|foo', then using a trie, it
|
||||
/// is transformed to something like 'b(a(r|z))|f'. This reduces the clog by
|
||||
/// reducing the number of epsilon transitions out of the alternation's start
|
||||
/// state from 3 to 2 (it actually gets down to 1 when you use a sparse state,
|
||||
/// which we do below). It's a small effect here, but when your alternation is
|
||||
/// huge, the savings is also huge.
|
||||
///
|
||||
/// And that is... essentially what a LiteralTrie does. But there is one
|
||||
/// hiccup. Consider a regex like 'sam|samwise'. How does a prefix trie compile
|
||||
/// that when leftmost-first semantics are used? If 'sam|samwise' was the
|
||||
/// entire regex, then you could just drop the 'samwise' branch entirely since
|
||||
/// it is impossible to match ('sam' will always take priority, and since it
|
||||
/// is a prefix of 'samwise', 'samwise' will never match). But what about the
|
||||
/// regex '\b(sam|samwise)\b'? In that case, you can't remove 'samwise' because
|
||||
/// it might match when 'sam' doesn't fall on a word boundary.
|
||||
///
|
||||
/// The main idea is that 'sam|samwise' can be translated to 'sam(?:|wise)',
|
||||
/// which is a precisely equivalent regex that also gets rid of the clog.
|
||||
///
|
||||
/// Another example is 'zapper|z|zap'. That gets translated to
|
||||
/// 'z(?:apper||ap)'.
|
||||
///
|
||||
/// We accomplish this by giving each state in the trie multiple "chunks" of
|
||||
/// transitions. Each chunk barrier represents a match. The idea is that once
|
||||
/// you know a match occurs, none of the transitions after the match can be
|
||||
/// re-ordered and mixed in with the transitions before the match. Otherwise,
|
||||
/// the match semantics could be changed.
|
||||
///
|
||||
/// See the 'State' data type for a bit more detail.
|
||||
///
|
||||
/// Future work:
|
||||
///
|
||||
/// * In theory, it would be nice to generalize the idea of removing clogs and
|
||||
/// apply it to the NFA graph itself. Then this could in theory work for
|
||||
/// case insensitive alternations of literals, or even just alternations where
|
||||
/// each branch starts with a non-epsilon transition.
|
||||
/// * Could we instead use the Aho-Corasick algorithm here? The aho-corasick
|
||||
/// crate deals with leftmost-first matches correctly, but I think this implies
|
||||
/// encoding failure transitions into a Thompson NFA somehow. Which seems fine,
|
||||
/// because failure transitions are just unconditional epsilon transitions?
|
||||
/// * Or perhaps even better, could we use an aho_corasick::AhoCorasick
|
||||
/// directly? At time of writing, 0.7 is the current version of the
|
||||
/// aho-corasick crate, and that definitely cannot be used as-is. But if we
|
||||
/// expose the underlying finite state machine API, then could we use it? That
|
||||
/// would be super. If we could figure that out, it might also lend itself to
|
||||
/// more general composition of finite state machines.
|
||||
#[derive(Clone)]
|
||||
pub(crate) struct LiteralTrie {
|
||||
/// The set of trie states. Each state contains one or more chunks, where
|
||||
/// each chunk is a sparse set of transitions to other states. A leaf state
|
||||
/// is always a match state that contains only empty chunks (i.e., no
|
||||
/// transitions).
|
||||
states: Vec<State>,
|
||||
/// Whether to add literals in reverse to the trie. Useful when building
|
||||
/// a reverse NFA automaton.
|
||||
rev: bool,
|
||||
}
|
||||
|
||||
impl LiteralTrie {
|
||||
/// Create a new literal trie that adds literals in the forward direction.
|
||||
pub(crate) fn forward() -> LiteralTrie {
|
||||
let root = State::default();
|
||||
LiteralTrie { states: vec![root], rev: false }
|
||||
}
|
||||
|
||||
/// Create a new literal trie that adds literals in reverse.
|
||||
pub(crate) fn reverse() -> LiteralTrie {
|
||||
let root = State::default();
|
||||
LiteralTrie { states: vec![root], rev: true }
|
||||
}
|
||||
|
||||
/// Add the given literal to this trie.
|
||||
///
|
||||
/// If the literal could not be added because the `StateID` space was
|
||||
/// exhausted, then an error is returned. If an error returns, the trie
|
||||
/// is in an unspecified state.
|
||||
pub(crate) fn add(&mut self, bytes: &[u8]) -> Result<(), BuildError> {
|
||||
let mut prev = StateID::ZERO;
|
||||
let mut it = bytes.iter().copied();
|
||||
while let Some(b) = if self.rev { it.next_back() } else { it.next() } {
|
||||
prev = self.get_or_add_state(prev, b)?;
|
||||
}
|
||||
self.states[prev].add_match();
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// If the given transition is defined, then return the next state ID.
|
||||
/// Otherwise, add the transition to `from` and point it to a new state.
|
||||
///
|
||||
/// If a new state ID could not be allocated, then an error is returned.
|
||||
fn get_or_add_state(
|
||||
&mut self,
|
||||
from: StateID,
|
||||
byte: u8,
|
||||
) -> Result<StateID, BuildError> {
|
||||
let active = self.states[from].active_chunk();
|
||||
match active.binary_search_by_key(&byte, |t| t.byte) {
|
||||
Ok(i) => Ok(active[i].next),
|
||||
Err(i) => {
|
||||
// Add a new state and get its ID.
|
||||
let next = StateID::new(self.states.len()).map_err(|_| {
|
||||
BuildError::too_many_states(self.states.len())
|
||||
})?;
|
||||
self.states.push(State::default());
|
||||
// Offset our position to account for all transitions and not
|
||||
// just the ones in the active chunk.
|
||||
let i = self.states[from].active_chunk_start() + i;
|
||||
let t = Transition { byte, next };
|
||||
self.states[from].transitions.insert(i, t);
|
||||
Ok(next)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Compile this literal trie to the NFA builder given.
|
||||
///
|
||||
/// This forwards any errors that may occur while using the given builder.
|
||||
pub(crate) fn compile(
|
||||
&self,
|
||||
builder: &mut Builder,
|
||||
) -> Result<ThompsonRef, BuildError> {
|
||||
// Compilation proceeds via depth-first traversal of the trie.
|
||||
//
|
||||
// This is overall pretty brutal. The recursive version of this is
|
||||
// deliciously simple. (See 'compile_to_hir' below for what it might
|
||||
// look like.) But recursion on a trie means your call stack grows
|
||||
// in accordance with the longest literal, which just does not seem
|
||||
// appropriate. So we push the call stack to the heap. But as a result,
|
||||
// the trie traversal becomes pretty brutal because we essentially
|
||||
// have to encode the state of a double for-loop into an explicit call
|
||||
// frame. If someone can simplify this without using recursion, that'd
|
||||
// be great.
|
||||
|
||||
// 'end' is our match state for this trie, but represented in the the
|
||||
// NFA. Any time we see a match in the trie, we insert a transition
|
||||
// from the current state we're in to 'end'.
|
||||
let end = builder.add_empty()?;
|
||||
let mut stack = vec![];
|
||||
let mut f = Frame::new(&self.states[StateID::ZERO]);
|
||||
loop {
|
||||
if let Some(t) = f.transitions.next() {
|
||||
if self.states[t.next].is_leaf() {
|
||||
f.sparse.push(thompson::Transition {
|
||||
start: t.byte,
|
||||
end: t.byte,
|
||||
next: end,
|
||||
});
|
||||
} else {
|
||||
f.sparse.push(thompson::Transition {
|
||||
start: t.byte,
|
||||
end: t.byte,
|
||||
// This is a little funny, but when the frame we create
|
||||
// below completes, it will pop this parent frame off
|
||||
// and modify this transition to point to the correct
|
||||
// state.
|
||||
next: StateID::ZERO,
|
||||
});
|
||||
stack.push(f);
|
||||
f = Frame::new(&self.states[t.next]);
|
||||
}
|
||||
continue;
|
||||
}
|
||||
// At this point, we have visited all transitions in f.chunk, so
|
||||
// add it as a sparse NFA state. Unless the chunk was empty, in
|
||||
// which case, we don't do anything.
|
||||
if !f.sparse.is_empty() {
|
||||
let chunk_id = if f.sparse.len() == 1 {
|
||||
builder.add_range(f.sparse.pop().unwrap())?
|
||||
} else {
|
||||
let sparse = mem::replace(&mut f.sparse, vec![]);
|
||||
builder.add_sparse(sparse)?
|
||||
};
|
||||
f.union.push(chunk_id);
|
||||
}
|
||||
// Now we need to look to see if there are other chunks to visit.
|
||||
if let Some(chunk) = f.chunks.next() {
|
||||
// If we're here, it means we're on the second (or greater)
|
||||
// chunk, which implies there is a match at this point. So
|
||||
// connect this state to the final end state.
|
||||
f.union.push(end);
|
||||
// Advance to the next chunk.
|
||||
f.transitions = chunk.iter();
|
||||
continue;
|
||||
}
|
||||
// Now that we are out of chunks, we have completely visited
|
||||
// this state. So turn our union of chunks into an NFA union
|
||||
// state, and add that union state to the parent state's current
|
||||
// sparse state. (If there is no parent, we're done.)
|
||||
let start = builder.add_union(f.union)?;
|
||||
match stack.pop() {
|
||||
None => {
|
||||
return Ok(ThompsonRef { start, end });
|
||||
}
|
||||
Some(mut parent) => {
|
||||
// OK because the only way a frame gets pushed on to the
|
||||
// stack (aside from the root) is when a transition has
|
||||
// been added to 'sparse'.
|
||||
parent.sparse.last_mut().unwrap().next = start;
|
||||
f = parent;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Converts this trie to an equivalent HIR expression.
|
||||
///
|
||||
/// We don't actually use this, but it's useful for tests. In particular,
|
||||
/// it provides a (somewhat) human readable representation of the trie
|
||||
/// itself.
|
||||
#[cfg(test)]
|
||||
fn compile_to_hir(&self) -> regex_syntax::hir::Hir {
|
||||
self.compile_state_to_hir(StateID::ZERO)
|
||||
}
|
||||
|
||||
/// The recursive implementation of 'to_hir'.
|
||||
///
|
||||
/// Notice how simple this is compared to 'compile' above. 'compile' could
|
||||
/// be similarly simple, but we opt to not use recursion in order to avoid
|
||||
/// overflowing the stack in the case of a longer literal.
|
||||
#[cfg(test)]
|
||||
fn compile_state_to_hir(&self, sid: StateID) -> regex_syntax::hir::Hir {
|
||||
use regex_syntax::hir::Hir;
|
||||
|
||||
let mut alt = vec![];
|
||||
for (i, chunk) in self.states[sid].chunks().enumerate() {
|
||||
if i > 0 {
|
||||
alt.push(Hir::empty());
|
||||
}
|
||||
if chunk.is_empty() {
|
||||
continue;
|
||||
}
|
||||
let mut chunk_alt = vec![];
|
||||
for t in chunk.iter() {
|
||||
chunk_alt.push(Hir::concat(vec![
|
||||
Hir::literal(vec![t.byte]),
|
||||
self.compile_state_to_hir(t.next),
|
||||
]));
|
||||
}
|
||||
alt.push(Hir::alternation(chunk_alt));
|
||||
}
|
||||
Hir::alternation(alt)
|
||||
}
|
||||
}
|
||||
|
||||
impl core::fmt::Debug for LiteralTrie {
|
||||
fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
|
||||
writeln!(f, "LiteralTrie(")?;
|
||||
for (sid, state) in self.states.iter().with_state_ids() {
|
||||
writeln!(f, "{:06?}: {:?}", sid.as_usize(), state)?;
|
||||
}
|
||||
writeln!(f, ")")?;
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
/// An explicit stack frame used for traversing the trie without using
|
||||
/// recursion.
|
||||
///
|
||||
/// Each frame is tied to the traversal of a single trie state. The frame is
|
||||
/// dropped once the entire state (and all of its children) have been visited.
|
||||
/// The "output" of compiling a state is the 'union' vector, which is turn
|
||||
/// converted to a NFA union state. Each branch of the union corresponds to a
|
||||
/// chunk in the trie state.
|
||||
///
|
||||
/// 'sparse' corresponds to the set of transitions for a particular chunk in a
|
||||
/// trie state. It is ultimately converted to an NFA sparse state. The 'sparse'
|
||||
/// field, after being converted to a sparse NFA state, is reused for any
|
||||
/// subsequent chunks in the trie state, if any exist.
|
||||
#[derive(Debug)]
|
||||
struct Frame<'a> {
|
||||
/// The remaining chunks to visit for a trie state.
|
||||
chunks: StateChunksIter<'a>,
|
||||
/// The transitions of the current chunk that we're iterating over. Since
|
||||
/// every trie state has at least one chunk, every frame is initialized
|
||||
/// with the first chunk's transitions ready to be consumed.
|
||||
transitions: core::slice::Iter<'a, Transition>,
|
||||
/// The NFA state IDs pointing to the start of each chunk compiled by
|
||||
/// this trie state. This ultimately gets converted to an NFA union once
|
||||
/// the entire trie state (and all of its children) have been compiled.
|
||||
/// The order of these matters for leftmost-first match semantics, since
|
||||
/// earlier matches in the union are preferred over later ones.
|
||||
union: Vec<StateID>,
|
||||
/// The actual NFA transitions for a single chunk in a trie state. This
|
||||
/// gets converted to an NFA sparse state, and its corresponding NFA state
|
||||
/// ID should get added to 'union'.
|
||||
sparse: Vec<thompson::Transition>,
|
||||
}
|
||||
|
||||
impl<'a> Frame<'a> {
|
||||
/// Create a new stack frame for trie traversal. This initializes the
|
||||
/// 'transitions' iterator to the transitions for the first chunk, with the
|
||||
/// 'chunks' iterator being every chunk after the first one.
|
||||
fn new(state: &'a State) -> Frame<'a> {
|
||||
let mut chunks = state.chunks();
|
||||
// every state has at least 1 chunk
|
||||
let chunk = chunks.next().unwrap();
|
||||
let transitions = chunk.iter();
|
||||
Frame { chunks, transitions, union: vec![], sparse: vec![] }
|
||||
}
|
||||
}
|
||||
|
||||
/// A state in a trie.
|
||||
///
|
||||
/// This uses a sparse representation. Since we don't use literal tries
|
||||
/// for searching, and ultimately (and compilation requires visiting every
|
||||
/// transition anyway), we use a sparse representation for transitions. This
|
||||
/// means we save on memory, at the expense of 'LiteralTrie::add' being perhaps
|
||||
/// a bit slower.
|
||||
///
|
||||
/// While 'transitions' is pretty standard as far as tries goes, the 'chunks'
|
||||
/// piece here is more unusual. In effect, 'chunks' defines a partitioning
|
||||
/// of 'transitions', where each chunk corresponds to a distinct set of
|
||||
/// transitions. The key invariant is that a transition in one chunk cannot
|
||||
/// be moved to another chunk. This is the secret sauce that preserve
|
||||
/// leftmost-first match semantics.
|
||||
///
|
||||
/// A new chunk is added whenever we mark a state as a match state. Once a
|
||||
/// new chunk is added, the old active chunk is frozen and is never mutated
|
||||
/// again. The new chunk becomes the active chunk, which is defined as
|
||||
/// '&transitions[chunks.last().map_or(0, |c| c.1)..]'. Thus, a state where
|
||||
/// 'chunks' is empty actually contains one chunk. Thus, every state contains
|
||||
/// at least one (possibly empty) chunk.
|
||||
///
|
||||
/// A "leaf" state is a state that has no outgoing transitions (so
|
||||
/// 'transitions' is empty). Note that there is no way for a leaf state to be a
|
||||
/// non-matching state. (Although while building the trie, within 'add', a leaf
|
||||
/// state may exist while not containing any matches. But this invariant is
|
||||
/// only broken within 'add'. Once 'add' returns, the invariant is upheld.)
|
||||
#[derive(Clone, Default)]
|
||||
struct State {
|
||||
transitions: Vec<Transition>,
|
||||
chunks: Vec<(usize, usize)>,
|
||||
}
|
||||
|
||||
impl State {
|
||||
/// Mark this state as a match state and freeze the active chunk such that
|
||||
/// it can not be further mutated.
|
||||
fn add_match(&mut self) {
|
||||
// This is not strictly necessary, but there's no point in recording
|
||||
// another match by adding another chunk if the state has no
|
||||
// transitions. Note though that we only skip this if we already know
|
||||
// this is a match state, which is only true if 'chunks' is not empty.
|
||||
// Basically, if we didn't do this, nothing semantically would change,
|
||||
// but we'd end up pushing another chunk and potentially triggering an
|
||||
// alloc.
|
||||
if self.transitions.is_empty() && !self.chunks.is_empty() {
|
||||
return;
|
||||
}
|
||||
let chunk_start = self.active_chunk_start();
|
||||
let chunk_end = self.transitions.len();
|
||||
self.chunks.push((chunk_start, chunk_end));
|
||||
}
|
||||
|
||||
/// Returns true if and only if this state is a leaf state. That is, a
|
||||
/// state that has no outgoing transitions.
|
||||
fn is_leaf(&self) -> bool {
|
||||
self.transitions.is_empty()
|
||||
}
|
||||
|
||||
/// Returns an iterator over all of the chunks (including the currently
|
||||
/// active chunk) in this state. Since the active chunk is included, the
|
||||
/// iterator is guaranteed to always yield at least one chunk (although the
|
||||
/// chunk may be empty).
|
||||
fn chunks(&self) -> StateChunksIter<'_> {
|
||||
StateChunksIter {
|
||||
transitions: &*self.transitions,
|
||||
chunks: self.chunks.iter(),
|
||||
active: Some(self.active_chunk()),
|
||||
}
|
||||
}
|
||||
|
||||
/// Returns the active chunk as a slice of transitions.
|
||||
fn active_chunk(&self) -> &[Transition] {
|
||||
let start = self.active_chunk_start();
|
||||
&self.transitions[start..]
|
||||
}
|
||||
|
||||
/// Returns the index into 'transitions' where the active chunk starts.
|
||||
fn active_chunk_start(&self) -> usize {
|
||||
self.chunks.last().map_or(0, |&(_, end)| end)
|
||||
}
|
||||
}
|
||||
|
||||
impl core::fmt::Debug for State {
|
||||
fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
|
||||
let mut spacing = " ";
|
||||
for (i, chunk) in self.chunks().enumerate() {
|
||||
if i > 0 {
|
||||
write!(f, "{}MATCH", spacing)?;
|
||||
}
|
||||
spacing = "";
|
||||
for (j, t) in chunk.iter().enumerate() {
|
||||
spacing = " ";
|
||||
if j == 0 && i > 0 {
|
||||
write!(f, " ")?;
|
||||
} else if j > 0 {
|
||||
write!(f, ", ")?;
|
||||
}
|
||||
write!(f, "{:?}", t)?;
|
||||
}
|
||||
}
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
/// An iterator over all of the chunks in a state, including the active chunk.
|
||||
///
|
||||
/// This iterator is created by `State::chunks`. We name this iterator so that
|
||||
/// we can include it in the `Frame` type for non-recursive trie traversal.
|
||||
#[derive(Debug)]
|
||||
struct StateChunksIter<'a> {
|
||||
transitions: &'a [Transition],
|
||||
chunks: core::slice::Iter<'a, (usize, usize)>,
|
||||
active: Option<&'a [Transition]>,
|
||||
}
|
||||
|
||||
impl<'a> Iterator for StateChunksIter<'a> {
|
||||
type Item = &'a [Transition];
|
||||
|
||||
fn next(&mut self) -> Option<&'a [Transition]> {
|
||||
if let Some(&(start, end)) = self.chunks.next() {
|
||||
return Some(&self.transitions[start..end]);
|
||||
}
|
||||
if let Some(chunk) = self.active.take() {
|
||||
return Some(chunk);
|
||||
}
|
||||
None
|
||||
}
|
||||
}
|
||||
|
||||
/// A single transition in a trie to another state.
|
||||
#[derive(Clone, Copy)]
|
||||
struct Transition {
|
||||
byte: u8,
|
||||
next: StateID,
|
||||
}
|
||||
|
||||
impl core::fmt::Debug for Transition {
|
||||
fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
|
||||
write!(
|
||||
f,
|
||||
"{:?} => {}",
|
||||
crate::util::escape::DebugByte(self.byte),
|
||||
self.next.as_usize()
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use bstr::B;
|
||||
use regex_syntax::hir::Hir;
|
||||
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn zap() {
|
||||
let mut trie = LiteralTrie::forward();
|
||||
trie.add(b"zapper").unwrap();
|
||||
trie.add(b"z").unwrap();
|
||||
trie.add(b"zap").unwrap();
|
||||
|
||||
let got = trie.compile_to_hir();
|
||||
let expected = Hir::concat(vec![
|
||||
Hir::literal(B("z")),
|
||||
Hir::alternation(vec![
|
||||
Hir::literal(B("apper")),
|
||||
Hir::empty(),
|
||||
Hir::literal(B("ap")),
|
||||
]),
|
||||
]);
|
||||
assert_eq!(expected, got);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn maker() {
|
||||
let mut trie = LiteralTrie::forward();
|
||||
trie.add(b"make").unwrap();
|
||||
trie.add(b"maple").unwrap();
|
||||
trie.add(b"maker").unwrap();
|
||||
|
||||
let got = trie.compile_to_hir();
|
||||
let expected = Hir::concat(vec![
|
||||
Hir::literal(B("ma")),
|
||||
Hir::alternation(vec![
|
||||
Hir::concat(vec![
|
||||
Hir::literal(B("ke")),
|
||||
Hir::alternation(vec![Hir::empty(), Hir::literal(B("r"))]),
|
||||
]),
|
||||
Hir::literal(B("ple")),
|
||||
]),
|
||||
]);
|
||||
assert_eq!(expected, got);
|
||||
}
|
||||
}
|
||||
296
third-party/vendor/regex-automata/src/nfa/thompson/map.rs
vendored
Normal file
296
third-party/vendor/regex-automata/src/nfa/thompson/map.rs
vendored
Normal file
|
|
@ -0,0 +1,296 @@
|
|||
// This module contains a couple simple and purpose built hash maps. The key
|
||||
// trade off they make is that they serve as caches rather than true maps. That
|
||||
// is, inserting a new entry may cause eviction of another entry. This gives
|
||||
// us two things. First, there's less overhead associated with inserts and
|
||||
// lookups. Secondly, it lets us control our memory usage.
|
||||
//
|
||||
// These maps are used in some fairly hot code when generating NFA states for
|
||||
// large Unicode character classes.
|
||||
//
|
||||
// Instead of exposing a rich hashmap entry API, we just permit the caller to
|
||||
// produce a hash of the key directly. The hash can then be reused for both
|
||||
// lookups and insertions at the cost of leaking abstraction a bit. But these
|
||||
// are for internal use only, so it's fine.
|
||||
//
|
||||
// The Utf8BoundedMap is used for Daciuk's algorithm for constructing a
|
||||
// (almost) minimal DFA for large Unicode character classes in linear time.
|
||||
// (Daciuk's algorithm is always used when compiling forward NFAs. For reverse
|
||||
// NFAs, it's only used when the compiler is configured to 'shrink' the NFA,
|
||||
// since there's a bit more expense in the reverse direction.)
|
||||
//
|
||||
// The Utf8SuffixMap is used when compiling large Unicode character classes for
|
||||
// reverse NFAs when 'shrink' is disabled. Specifically, it augments the naive
|
||||
// construction of UTF-8 automata by caching common suffixes. This doesn't
|
||||
// get the same space savings as Daciuk's algorithm, but it's basically as
|
||||
// fast as the naive approach and typically winds up using less memory (since
|
||||
// it generates smaller NFAs) despite the presence of the cache.
|
||||
//
|
||||
// These maps effectively represent caching mechanisms for sparse and
|
||||
// byte-range NFA states, respectively. The former represents a single NFA
|
||||
// state with many transitions of equivalent priority while the latter
|
||||
// represents a single NFA state with a single transition. (Neither state ever
|
||||
// has or is an epsilon transition.) Thus, they have different key types. It's
|
||||
// likely we could make one generic map, but the machinery didn't seem worth
|
||||
// it. They are simple enough.
|
||||
|
||||
use alloc::{vec, vec::Vec};
|
||||
|
||||
use crate::{
|
||||
nfa::thompson::Transition,
|
||||
util::{
|
||||
int::{Usize, U64},
|
||||
primitives::StateID,
|
||||
},
|
||||
};
|
||||
|
||||
// Basic FNV-1a hash constants as described in:
|
||||
// https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function
|
||||
const PRIME: u64 = 1099511628211;
|
||||
const INIT: u64 = 14695981039346656037;
|
||||
|
||||
/// A bounded hash map where the key is a sequence of NFA transitions and the
|
||||
/// value is a pre-existing NFA state ID.
|
||||
///
|
||||
/// std's hashmap can be used for this, however, this map has two important
|
||||
/// advantages. Firstly, it has lower overhead. Secondly, it permits us to
|
||||
/// control our memory usage by limited the number of slots. In general, the
|
||||
/// cost here is that this map acts as a cache. That is, inserting a new entry
|
||||
/// may remove an old entry. We are okay with this, since it does not impact
|
||||
/// correctness in the cases where it is used. The only effect that dropping
|
||||
/// states from the cache has is that the resulting NFA generated may be bigger
|
||||
/// than it otherwise would be.
|
||||
///
|
||||
/// This improves benchmarks that compile large Unicode character classes,
|
||||
/// since it makes the generation of (almost) minimal UTF-8 automaton faster.
|
||||
/// Specifically, one could observe the difference with std's hashmap via
|
||||
/// something like the following benchmark:
|
||||
///
|
||||
/// hyperfine "regex-cli debug thompson -qr --captures none '\w{90} ecurB'"
|
||||
///
|
||||
/// But to observe that difference, you'd have to modify the code to use
|
||||
/// std's hashmap.
|
||||
///
|
||||
/// It is quite possible that there is a better way to approach this problem.
|
||||
/// For example, if there happens to be a very common state that collides with
|
||||
/// a lot of less frequent states, then we could wind up with very poor caching
|
||||
/// behavior. Alas, the effectiveness of this cache has not been measured.
|
||||
/// Instead, ad hoc experiments suggest that it is "good enough." Additional
|
||||
/// smarts (such as an LRU eviction policy) have to be weighed against the
|
||||
/// amount of extra time they cost.
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct Utf8BoundedMap {
|
||||
/// The current version of this map. Only entries with matching versions
|
||||
/// are considered during lookups. If an entry is found with a mismatched
|
||||
/// version, then the map behaves as if the entry does not exist.
|
||||
///
|
||||
/// This makes it possible to clear the map by simply incrementing the
|
||||
/// version number instead of actually deallocating any storage.
|
||||
version: u16,
|
||||
/// The total number of entries this map can store.
|
||||
capacity: usize,
|
||||
/// The actual entries, keyed by hash. Collisions between different states
|
||||
/// result in the old state being dropped.
|
||||
map: Vec<Utf8BoundedEntry>,
|
||||
}
|
||||
|
||||
/// An entry in this map.
|
||||
#[derive(Clone, Debug, Default)]
|
||||
struct Utf8BoundedEntry {
|
||||
/// The version of the map used to produce this entry. If this entry's
|
||||
/// version does not match the current version of the map, then the map
|
||||
/// should behave as if this entry does not exist.
|
||||
version: u16,
|
||||
/// The key, which is a sorted sequence of non-overlapping NFA transitions.
|
||||
key: Vec<Transition>,
|
||||
/// The state ID corresponding to the state containing the transitions in
|
||||
/// this entry.
|
||||
val: StateID,
|
||||
}
|
||||
|
||||
impl Utf8BoundedMap {
|
||||
/// Create a new bounded map with the given capacity. The map will never
|
||||
/// grow beyond the given size.
|
||||
///
|
||||
/// Note that this does not allocate. Instead, callers must call `clear`
|
||||
/// before using this map. `clear` will allocate space if necessary.
|
||||
///
|
||||
/// This avoids the need to pay for the allocation of this map when
|
||||
/// compiling regexes that lack large Unicode character classes.
|
||||
pub fn new(capacity: usize) -> Utf8BoundedMap {
|
||||
assert!(capacity > 0);
|
||||
Utf8BoundedMap { version: 0, capacity, map: vec![] }
|
||||
}
|
||||
|
||||
/// Clear this map of all entries, but permit the reuse of allocation
|
||||
/// if possible.
|
||||
///
|
||||
/// This must be called before the map can be used.
|
||||
pub fn clear(&mut self) {
|
||||
if self.map.is_empty() {
|
||||
self.map = vec![Utf8BoundedEntry::default(); self.capacity];
|
||||
} else {
|
||||
self.version = self.version.wrapping_add(1);
|
||||
// If we loop back to version 0, then we forcefully clear the
|
||||
// entire map. Otherwise, it might be possible to incorrectly
|
||||
// match entries used to generate other NFAs.
|
||||
if self.version == 0 {
|
||||
self.map = vec![Utf8BoundedEntry::default(); self.capacity];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Return a hash of the given transitions.
|
||||
pub fn hash(&self, key: &[Transition]) -> usize {
|
||||
let mut h = INIT;
|
||||
for t in key {
|
||||
h = (h ^ u64::from(t.start)).wrapping_mul(PRIME);
|
||||
h = (h ^ u64::from(t.end)).wrapping_mul(PRIME);
|
||||
h = (h ^ t.next.as_u64()).wrapping_mul(PRIME);
|
||||
}
|
||||
(h % self.map.len().as_u64()).as_usize()
|
||||
}
|
||||
|
||||
/// Retrieve the cached state ID corresponding to the given key. The hash
|
||||
/// given must have been computed with `hash` using the same key value.
|
||||
///
|
||||
/// If there is no cached state with the given transitions, then None is
|
||||
/// returned.
|
||||
pub fn get(&mut self, key: &[Transition], hash: usize) -> Option<StateID> {
|
||||
let entry = &self.map[hash];
|
||||
if entry.version != self.version {
|
||||
return None;
|
||||
}
|
||||
// There may be a hash collision, so we need to confirm real equality.
|
||||
if entry.key != key {
|
||||
return None;
|
||||
}
|
||||
Some(entry.val)
|
||||
}
|
||||
|
||||
/// Add a cached state to this map with the given key. Callers should
|
||||
/// ensure that `state_id` points to a state that contains precisely the
|
||||
/// NFA transitions given.
|
||||
///
|
||||
/// `hash` must have been computed using the `hash` method with the same
|
||||
/// key.
|
||||
pub fn set(
|
||||
&mut self,
|
||||
key: Vec<Transition>,
|
||||
hash: usize,
|
||||
state_id: StateID,
|
||||
) {
|
||||
self.map[hash] =
|
||||
Utf8BoundedEntry { version: self.version, key, val: state_id };
|
||||
}
|
||||
}
|
||||
|
||||
/// A cache of suffixes used to modestly compress UTF-8 automata for large
|
||||
/// Unicode character classes.
|
||||
#[derive(Clone, Debug)]
|
||||
pub struct Utf8SuffixMap {
|
||||
/// The current version of this map. Only entries with matching versions
|
||||
/// are considered during lookups. If an entry is found with a mismatched
|
||||
/// version, then the map behaves as if the entry does not exist.
|
||||
version: u16,
|
||||
/// The total number of entries this map can store.
|
||||
capacity: usize,
|
||||
/// The actual entries, keyed by hash. Collisions between different states
|
||||
/// result in the old state being dropped.
|
||||
map: Vec<Utf8SuffixEntry>,
|
||||
}
|
||||
|
||||
/// A key that uniquely identifies an NFA state. It is a triple that represents
|
||||
/// a transition from one state for a particular byte range.
|
||||
#[derive(Clone, Debug, Default, Eq, PartialEq)]
|
||||
pub struct Utf8SuffixKey {
|
||||
pub from: StateID,
|
||||
pub start: u8,
|
||||
pub end: u8,
|
||||
}
|
||||
|
||||
/// An entry in this map.
|
||||
#[derive(Clone, Debug, Default)]
|
||||
struct Utf8SuffixEntry {
|
||||
/// The version of the map used to produce this entry. If this entry's
|
||||
/// version does not match the current version of the map, then the map
|
||||
/// should behave as if this entry does not exist.
|
||||
version: u16,
|
||||
/// The key, which consists of a transition in a particular state.
|
||||
key: Utf8SuffixKey,
|
||||
/// The identifier that the transition in the key maps to.
|
||||
val: StateID,
|
||||
}
|
||||
|
||||
impl Utf8SuffixMap {
|
||||
/// Create a new bounded map with the given capacity. The map will never
|
||||
/// grow beyond the given size.
|
||||
///
|
||||
/// Note that this does not allocate. Instead, callers must call `clear`
|
||||
/// before using this map. `clear` will allocate space if necessary.
|
||||
///
|
||||
/// This avoids the need to pay for the allocation of this map when
|
||||
/// compiling regexes that lack large Unicode character classes.
|
||||
pub fn new(capacity: usize) -> Utf8SuffixMap {
|
||||
assert!(capacity > 0);
|
||||
Utf8SuffixMap { version: 0, capacity, map: vec![] }
|
||||
}
|
||||
|
||||
/// Clear this map of all entries, but permit the reuse of allocation
|
||||
/// if possible.
|
||||
///
|
||||
/// This must be called before the map can be used.
|
||||
pub fn clear(&mut self) {
|
||||
if self.map.is_empty() {
|
||||
self.map = vec![Utf8SuffixEntry::default(); self.capacity];
|
||||
} else {
|
||||
self.version = self.version.wrapping_add(1);
|
||||
if self.version == 0 {
|
||||
self.map = vec![Utf8SuffixEntry::default(); self.capacity];
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Return a hash of the given transition.
|
||||
pub fn hash(&self, key: &Utf8SuffixKey) -> usize {
|
||||
// Basic FNV-1a hash as described:
|
||||
// https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function
|
||||
const PRIME: u64 = 1099511628211;
|
||||
const INIT: u64 = 14695981039346656037;
|
||||
|
||||
let mut h = INIT;
|
||||
h = (h ^ key.from.as_u64()).wrapping_mul(PRIME);
|
||||
h = (h ^ u64::from(key.start)).wrapping_mul(PRIME);
|
||||
h = (h ^ u64::from(key.end)).wrapping_mul(PRIME);
|
||||
(h % self.map.len().as_u64()).as_usize()
|
||||
}
|
||||
|
||||
/// Retrieve the cached state ID corresponding to the given key. The hash
|
||||
/// given must have been computed with `hash` using the same key value.
|
||||
///
|
||||
/// If there is no cached state with the given key, then None is returned.
|
||||
pub fn get(
|
||||
&mut self,
|
||||
key: &Utf8SuffixKey,
|
||||
hash: usize,
|
||||
) -> Option<StateID> {
|
||||
let entry = &self.map[hash];
|
||||
if entry.version != self.version {
|
||||
return None;
|
||||
}
|
||||
if key != &entry.key {
|
||||
return None;
|
||||
}
|
||||
Some(entry.val)
|
||||
}
|
||||
|
||||
/// Add a cached state to this map with the given key. Callers should
|
||||
/// ensure that `state_id` points to a state that contains precisely the
|
||||
/// NFA transition given.
|
||||
///
|
||||
/// `hash` must have been computed using the `hash` method with the same
|
||||
/// key.
|
||||
pub fn set(&mut self, key: Utf8SuffixKey, hash: usize, state_id: StateID) {
|
||||
self.map[hash] =
|
||||
Utf8SuffixEntry { version: self.version, key, val: state_id };
|
||||
}
|
||||
}
|
||||
81
third-party/vendor/regex-automata/src/nfa/thompson/mod.rs
vendored
Normal file
81
third-party/vendor/regex-automata/src/nfa/thompson/mod.rs
vendored
Normal file
|
|
@ -0,0 +1,81 @@
|
|||
/*!
|
||||
Defines a Thompson NFA and provides the [`PikeVM`](pikevm::PikeVM) and
|
||||
[`BoundedBacktracker`](backtrack::BoundedBacktracker) regex engines.
|
||||
|
||||
A Thompson NFA (non-deterministic finite automaton) is arguably _the_ central
|
||||
data type in this library. It is the result of what is commonly referred to as
|
||||
"regex compilation." That is, turning a regex pattern from its concrete syntax
|
||||
string into something that can run a search looks roughly like this:
|
||||
|
||||
* A `&str` is parsed into a [`regex-syntax::ast::Ast`](regex_syntax::ast::Ast).
|
||||
* An `Ast` is translated into a [`regex-syntax::hir::Hir`](regex_syntax::hir::Hir).
|
||||
* An `Hir` is compiled into a [`NFA`].
|
||||
* The `NFA` is then used to build one of a few different regex engines:
|
||||
* An `NFA` is used directly in the `PikeVM` and `BoundedBacktracker` engines.
|
||||
* An `NFA` is used by a [hybrid NFA/DFA](crate::hybrid) to build out a DFA's
|
||||
transition table at search time.
|
||||
* An `NFA`, assuming it is one-pass, is used to build a full
|
||||
[one-pass DFA](crate::dfa::onepass) ahead of time.
|
||||
* An `NFA` is used to build a [full DFA](crate::dfa) ahead of time.
|
||||
|
||||
The [`meta`](crate::meta) regex engine makes all of these choices for you based
|
||||
on various criteria. However, if you have a lower level use case, _you_ can
|
||||
build any of the above regex engines and use them directly. But you must start
|
||||
here by building an `NFA`.
|
||||
|
||||
# Details
|
||||
|
||||
It is perhaps worth expanding a bit more on what it means to go through the
|
||||
`&str`->`Ast`->`Hir`->`NFA` process.
|
||||
|
||||
* Parsing a string into an `Ast` gives it a structured representation.
|
||||
Crucially, the size and amount of work done in this step is proportional to the
|
||||
size of the original string. No optimization or Unicode handling is done at
|
||||
this point. This means that parsing into an `Ast` has very predictable costs.
|
||||
Moreover, an `Ast` can be roundtripped back to its original pattern string as
|
||||
written.
|
||||
* Translating an `Ast` into an `Hir` is a process by which the structured
|
||||
representation is simplified down to its most fundamental components.
|
||||
Translation deals with flags such as case insensitivity by converting things
|
||||
like `(?i:a)` to `[Aa]`. Translation is also where Unicode tables are consulted
|
||||
to resolve things like `\p{Emoji}` and `\p{Greek}`. It also flattens each
|
||||
character class, regardless of how deeply nested it is, into a single sequence
|
||||
of non-overlapping ranges. All the various literal forms are thrown out in
|
||||
favor of one common representation. Overall, the `Hir` is small enough to fit
|
||||
into your head and makes analysis and other tasks much simpler.
|
||||
* Compiling an `Hir` into an `NFA` formulates the regex into a finite state
|
||||
machine whose transitions are defined over bytes. For example, an `Hir` might
|
||||
have a Unicode character class corresponding to a sequence of ranges defined
|
||||
in terms of `char`. Compilation is then responsible for turning those ranges
|
||||
into a UTF-8 automaton. That is, an automaton that matches the UTF-8 encoding
|
||||
of just the codepoints specified by those ranges. Otherwise, the main job of
|
||||
an `NFA` is to serve as a byte-code of sorts for a virtual machine. It can be
|
||||
seen as a sequence of instructions for how to match a regex.
|
||||
*/
|
||||
|
||||
#[cfg(feature = "nfa-backtrack")]
|
||||
pub mod backtrack;
|
||||
mod builder;
|
||||
#[cfg(feature = "syntax")]
|
||||
mod compiler;
|
||||
mod error;
|
||||
#[cfg(feature = "syntax")]
|
||||
mod literal_trie;
|
||||
#[cfg(feature = "syntax")]
|
||||
mod map;
|
||||
mod nfa;
|
||||
#[cfg(feature = "nfa-pikevm")]
|
||||
pub mod pikevm;
|
||||
#[cfg(feature = "syntax")]
|
||||
mod range_trie;
|
||||
|
||||
pub use self::{
|
||||
builder::Builder,
|
||||
error::BuildError,
|
||||
nfa::{
|
||||
DenseTransitions, PatternIter, SparseTransitions, State, Transition,
|
||||
NFA,
|
||||
},
|
||||
};
|
||||
#[cfg(feature = "syntax")]
|
||||
pub use compiler::{Compiler, Config, WhichCaptures};
|
||||
2099
third-party/vendor/regex-automata/src/nfa/thompson/nfa.rs
vendored
Normal file
2099
third-party/vendor/regex-automata/src/nfa/thompson/nfa.rs
vendored
Normal file
File diff suppressed because it is too large
Load diff
2359
third-party/vendor/regex-automata/src/nfa/thompson/pikevm.rs
vendored
Normal file
2359
third-party/vendor/regex-automata/src/nfa/thompson/pikevm.rs
vendored
Normal file
File diff suppressed because it is too large
Load diff
1051
third-party/vendor/regex-automata/src/nfa/thompson/range_trie.rs
vendored
Normal file
1051
third-party/vendor/regex-automata/src/nfa/thompson/range_trie.rs
vendored
Normal file
File diff suppressed because it is too large
Load diff
Loading…
Add table
Add a link
Reference in a new issue