Vendor things

2024-03-08 11:03:01 -08:00 · 2024-03-08 11:03:01 -08:00 · 977e3c17e5
commit 977e3c17e5
parent 5deceec006
19434 changed files with 10682014 additions and 0 deletions
--- a/third-party/vendor/regex-automata/src/nfa/mod.rs
+++ b/third-party/vendor/regex-automata/src/nfa/mod.rs
@ -0,0 +1,55 @@
+/*!
+Provides non-deterministic finite automata (NFA) and regex engines that use
+them.
+
+While NFAs and DFAs (deterministic finite automata) have equivalent *theoretical*
+power, their usage in practice tends to result in different engineering trade
+offs. While this isn't meant to be a comprehensive treatment of the topic, here
+are a few key trade offs that are, at minimum, true for this crate:
+
+* NFAs tend to be represented sparsely where as DFAs are represented densely.
+Sparse representations use less memory, but are slower to traverse. Conversely,
+dense representations use more memory, but are faster to traverse. (Sometimes
+these lines are blurred. For example, an `NFA` might choose to represent a
+particular state in a dense fashion, and a DFA can be built using a sparse
+representation via [`sparse::DFA`](crate::dfa::sparse::DFA).
+* NFAs have espilon transitions and DFAs don't. In practice, this means that
+handling a single byte in a haystack with an NFA at search time may require
+visiting multiple NFA states. In a DFA, each byte only requires visiting
+a single state. Stated differently, NFAs require a variable number of CPU
+instructions to process one byte in a haystack where as a DFA uses a constant
+number of CPU instructions to process one byte.
+* NFAs are generally easier to amend with secondary storage. For example, the
+[`thompson::pikevm::PikeVM`] uses an NFA to match, but also uses additional
+memory beyond the model of a finite state machine to track offsets for matching
+capturing groups. Conversely, the most a DFA can do is report the offset (and
+pattern ID) at which a match occurred. This is generally why we also compile
+DFAs in reverse, so that we can run them after finding the end of a match to
+also find the start of a match.
+* NFAs take worst case linear time to build, but DFAs take worst case
+exponential time to build. The [hybrid NFA/DFA](crate::hybrid) mitigates this
+challenge for DFAs in many practical cases.
+
+There are likely other differences, but the bottom line is that NFAs tend to be
+more memory efficient and give easier opportunities for increasing expressive
+power, where as DFAs are faster to search with.
+
+# Why only a Thompson NFA?
+
+Currently, the only kind of NFA we support in this crate is a [Thompson
+NFA](https://en.wikipedia.org/wiki/Thompson%27s_construction). This refers
+to a specific construction algorithm that takes the syntax of a regex
+pattern and converts it to an NFA. Specifically, it makes gratuitous use of
+epsilon transitions in order to keep its structure simple. In exchange, its
+construction time is linear in the size of the regex. A Thompson NFA also makes
+the guarantee that given any state and a character in a haystack, there is at
+most one transition defined for it. (Although there may be many epsilon
+transitions.)
+
+It possible that other types of NFAs will be added in the future, such as a
+[Glushkov NFA](https://en.wikipedia.org/wiki/Glushkov%27s_construction_algorithm).
+But currently, this crate only provides a Thompson NFA.
+*/
+
+#[cfg(feature = "nfa-thompson")]
+pub mod thompson;
--- a/third-party/vendor/regex-automata/src/nfa/thompson/backtrack.rs
+++ b/third-party/vendor/regex-automata/src/nfa/thompson/backtrack.rs
--- a/third-party/vendor/regex-automata/src/nfa/thompson/builder.rs
+++ b/third-party/vendor/regex-automata/src/nfa/thompson/builder.rs
--- a/third-party/vendor/regex-automata/src/nfa/thompson/compiler.rs
+++ b/third-party/vendor/regex-automata/src/nfa/thompson/compiler.rs
--- a/third-party/vendor/regex-automata/src/nfa/thompson/error.rs
+++ b/third-party/vendor/regex-automata/src/nfa/thompson/error.rs
@ -0,0 +1,185 @@
+use crate::util::{
+    captures, look,
+    primitives::{PatternID, StateID},
+};
+
+/// An error that can occurred during the construction of a thompson NFA.
+///
+/// This error does not provide many introspection capabilities. There are
+/// generally only two things you can do with it:
+///
+/// * Obtain a human readable message via its `std::fmt::Display` impl.
+/// * Access an underlying [`regex_syntax::Error`] type from its `source`
+/// method via the `std::error::Error` trait. This error only occurs when using
+/// convenience routines for building an NFA directly from a pattern string.
+///
+/// Otherwise, errors typically occur when a limit has been breeched. For
+/// example, if the total heap usage of the compiled NFA exceeds the limit
+/// set by [`Config::nfa_size_limit`](crate::nfa::thompson::Config), then
+/// building the NFA will fail.
+#[derive(Clone, Debug)]
+pub struct BuildError {
+    kind: BuildErrorKind,
+}
+
+/// The kind of error that occurred during the construction of a thompson NFA.
+#[derive(Clone, Debug)]
+enum BuildErrorKind {
+    /// An error that occurred while parsing a regular expression. Note that
+    /// this error may be printed over multiple lines, and is generally
+    /// intended to be end user readable on its own.
+    #[cfg(feature = "syntax")]
+    Syntax(regex_syntax::Error),
+    /// An error that occurs if the capturing groups provided to an NFA builder
+    /// do not satisfy the documented invariants. For example, things like
+    /// too many groups, missing groups, having the first (zeroth) group be
+    /// named or duplicate group names within the same pattern.
+    Captures(captures::GroupInfoError),
+    /// An error that occurs when an NFA contains a Unicode word boundary, but
+    /// where the crate was compiled without the necessary data for dealing
+    /// with Unicode word boundaries.
+    Word(look::UnicodeWordBoundaryError),
+    /// An error that occurs if too many patterns were given to the NFA
+    /// compiler.
+    TooManyPatterns {
+        /// The number of patterns given, which exceeds the limit.
+        given: usize,
+        /// The limit on the number of patterns.
+        limit: usize,
+    },
+    /// An error that occurs if too states are produced while building an NFA.
+    TooManyStates {
+        /// The minimum number of states that are desired, which exceeds the
+        /// limit.
+        given: usize,
+        /// The limit on the number of states.
+        limit: usize,
+    },
+    /// An error that occurs when NFA compilation exceeds a configured heap
+    /// limit.
+    ExceededSizeLimit {
+        /// The configured limit, in bytes.
+        limit: usize,
+    },
+    /// An error that occurs when an invalid capture group index is added to
+    /// the NFA. An "invalid" index can be one that would otherwise overflow
+    /// a `usize` on the current target.
+    InvalidCaptureIndex {
+        /// The invalid index that was given.
+        index: u32,
+    },
+    /// An error that occurs when one tries to build a reverse NFA with
+    /// captures enabled. Currently, this isn't supported, but we probably
+    /// should support it at some point.
+    #[cfg(feature = "syntax")]
+    UnsupportedCaptures,
+}
+
+impl BuildError {
+    /// If this error occurred because the NFA exceeded the configured size
+    /// limit before being built, then this returns the configured size limit.
+    ///
+    /// The limit returned is what was configured, and corresponds to the
+    /// maximum amount of heap usage in bytes.
+    pub fn size_limit(&self) -> Option<usize> {
+        match self.kind {
+            BuildErrorKind::ExceededSizeLimit { limit } => Some(limit),
+            _ => None,
+        }
+    }
+
+    fn kind(&self) -> &BuildErrorKind {
+        &self.kind
+    }
+
+    #[cfg(feature = "syntax")]
+    pub(crate) fn syntax(err: regex_syntax::Error) -> BuildError {
+        BuildError { kind: BuildErrorKind::Syntax(err) }
+    }
+
+    pub(crate) fn captures(err: captures::GroupInfoError) -> BuildError {
+        BuildError { kind: BuildErrorKind::Captures(err) }
+    }
+
+    pub(crate) fn word(err: look::UnicodeWordBoundaryError) -> BuildError {
+        BuildError { kind: BuildErrorKind::Word(err) }
+    }
+
+    pub(crate) fn too_many_patterns(given: usize) -> BuildError {
+        let limit = PatternID::LIMIT;
+        BuildError { kind: BuildErrorKind::TooManyPatterns { given, limit } }
+    }
+
+    pub(crate) fn too_many_states(given: usize) -> BuildError {
+        let limit = StateID::LIMIT;
+        BuildError { kind: BuildErrorKind::TooManyStates { given, limit } }
+    }
+
+    pub(crate) fn exceeded_size_limit(limit: usize) -> BuildError {
+        BuildError { kind: BuildErrorKind::ExceededSizeLimit { limit } }
+    }
+
+    pub(crate) fn invalid_capture_index(index: u32) -> BuildError {
+        BuildError { kind: BuildErrorKind::InvalidCaptureIndex { index } }
+    }
+
+    #[cfg(feature = "syntax")]
+    pub(crate) fn unsupported_captures() -> BuildError {
+        BuildError { kind: BuildErrorKind::UnsupportedCaptures }
+    }
+}
+
+#[cfg(feature = "std")]
+impl std::error::Error for BuildError {
+    fn source(&self) -> Option<&(dyn std::error::Error + 'static)> {
+        match self.kind() {
+            #[cfg(feature = "syntax")]
+            BuildErrorKind::Syntax(ref err) => Some(err),
+            BuildErrorKind::Captures(ref err) => Some(err),
+            _ => None,
+        }
+    }
+}
+
+impl core::fmt::Display for BuildError {
+    fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
+        match self.kind() {
+            #[cfg(feature = "syntax")]
+            BuildErrorKind::Syntax(_) => write!(f, "error parsing regex"),
+            BuildErrorKind::Captures(_) => {
+                write!(f, "error with capture groups")
+            }
+            BuildErrorKind::Word(_) => {
+                write!(f, "NFA contains Unicode word boundary")
+            }
+            BuildErrorKind::TooManyPatterns { given, limit } => write!(
+                f,
+                "attempted to compile {} patterns, \
+                 which exceeds the limit of {}",
+                given, limit,
+            ),
+            BuildErrorKind::TooManyStates { given, limit } => write!(
+                f,
+                "attempted to compile {} NFA states, \
+                 which exceeds the limit of {}",
+                given, limit,
+            ),
+            BuildErrorKind::ExceededSizeLimit { limit } => write!(
+                f,
+                "heap usage during NFA compilation exceeded limit of {}",
+                limit,
+            ),
+            BuildErrorKind::InvalidCaptureIndex { index } => write!(
+                f,
+                "capture group index {} is invalid (too big or discontinuous)",
+                index,
+            ),
+            #[cfg(feature = "syntax")]
+            BuildErrorKind::UnsupportedCaptures => write!(
+                f,
+                "currently captures must be disabled when compiling \
+                 a reverse NFA",
+            ),
+        }
+    }
+}
--- a/third-party/vendor/regex-automata/src/nfa/thompson/literal_trie.rs
+++ b/third-party/vendor/regex-automata/src/nfa/thompson/literal_trie.rs
@ -0,0 +1,528 @@
+use core::mem;
+
+use alloc::{vec, vec::Vec};
+
+use crate::{
+    nfa::thompson::{self, compiler::ThompsonRef, BuildError, Builder},
+    util::primitives::{IteratorIndexExt, StateID},
+};
+
+/// A trie that preserves leftmost-first match semantics.
+///
+/// This is a purpose-built data structure for optimizing 'lit1|lit2|..|litN'
+/// patterns. It can *only* handle alternations of literals, which makes it
+/// somewhat restricted in its scope, but literal alternations are fairly
+/// common.
+///
+/// At a 5,000 foot level, the main idea of this trie is make an alternation of
+/// literals look more like a DFA than an NFA via epsilon removal.
+///
+/// More precisely, the main issue is in how alternations are compiled into
+/// a Thompson NFA. Namely, each alternation gets a single NFA "union" state
+/// with an epsilon transition for every branch of the alternation pointing to
+/// an NFA state corresponding to the start of that branch. The main problem
+/// with this representation is the cost of computing an epsilon closure. Once
+/// you hit the alternation's start state, it acts as a sort of "clog" that
+/// requires you to traverse all of the epsilon transitions to compute the full
+/// closure.
+///
+/// While fixing such clogs in the general case is pretty tricky without going
+/// to a DFA (or perhaps a Glushkov NFA, but that comes with other problems).
+/// But at least in the case of an alternation of literals, we can convert
+/// that to a prefix trie without too much cost. In theory, that's all you
+/// really need to do: build the trie and then compile it to a Thompson NFA.
+/// For example, if you have the pattern 'bar|baz|foo', then using a trie, it
+/// is transformed to something like 'b(a(r|z))|f'. This reduces the clog by
+/// reducing the number of epsilon transitions out of the alternation's start
+/// state from 3 to 2 (it actually gets down to 1 when you use a sparse state,
+/// which we do below). It's a small effect here, but when your alternation is
+/// huge, the savings is also huge.
+///
+/// And that is... essentially what a LiteralTrie does. But there is one
+/// hiccup. Consider a regex like 'sam|samwise'. How does a prefix trie compile
+/// that when leftmost-first semantics are used? If 'sam|samwise' was the
+/// entire regex, then you could just drop the 'samwise' branch entirely since
+/// it is impossible to match ('sam' will always take priority, and since it
+/// is a prefix of 'samwise', 'samwise' will never match). But what about the
+/// regex '\b(sam|samwise)\b'? In that case, you can't remove 'samwise' because
+/// it might match when 'sam' doesn't fall on a word boundary.
+///
+/// The main idea is that 'sam|samwise' can be translated to 'sam(?:|wise)',
+/// which is a precisely equivalent regex that also gets rid of the clog.
+///
+/// Another example is 'zapper|z|zap'. That gets translated to
+/// 'z(?:apper||ap)'.
+///
+/// We accomplish this by giving each state in the trie multiple "chunks" of
+/// transitions. Each chunk barrier represents a match. The idea is that once
+/// you know a match occurs, none of the transitions after the match can be
+/// re-ordered and mixed in with the transitions before the match. Otherwise,
+/// the match semantics could be changed.
+///
+/// See the 'State' data type for a bit more detail.
+///
+/// Future work:
+///
+/// * In theory, it would be nice to generalize the idea of removing clogs and
+/// apply it to the NFA graph itself. Then this could in theory work for
+/// case insensitive alternations of literals, or even just alternations where
+/// each branch starts with a non-epsilon transition.
+/// * Could we instead use the Aho-Corasick algorithm here? The aho-corasick
+/// crate deals with leftmost-first matches correctly, but I think this implies
+/// encoding failure transitions into a Thompson NFA somehow. Which seems fine,
+/// because failure transitions are just unconditional epsilon transitions?
+/// * Or perhaps even better, could we use an aho_corasick::AhoCorasick
+/// directly? At time of writing, 0.7 is the current version of the
+/// aho-corasick crate, and that definitely cannot be used as-is. But if we
+/// expose the underlying finite state machine API, then could we use it? That
+/// would be super. If we could figure that out, it might also lend itself to
+/// more general composition of finite state machines.
+#[derive(Clone)]
+pub(crate) struct LiteralTrie {
+    /// The set of trie states. Each state contains one or more chunks, where
+    /// each chunk is a sparse set of transitions to other states. A leaf state
+    /// is always a match state that contains only empty chunks (i.e., no
+    /// transitions).
+    states: Vec<State>,
+    /// Whether to add literals in reverse to the trie. Useful when building
+    /// a reverse NFA automaton.
+    rev: bool,
+}
+
+impl LiteralTrie {
+    /// Create a new literal trie that adds literals in the forward direction.
+    pub(crate) fn forward() -> LiteralTrie {
+        let root = State::default();
+        LiteralTrie { states: vec![root], rev: false }
+    }
+
+    /// Create a new literal trie that adds literals in reverse.
+    pub(crate) fn reverse() -> LiteralTrie {
+        let root = State::default();
+        LiteralTrie { states: vec![root], rev: true }
+    }
+
+    /// Add the given literal to this trie.
+    ///
+    /// If the literal could not be added because the `StateID` space was
+    /// exhausted, then an error is returned. If an error returns, the trie
+    /// is in an unspecified state.
+    pub(crate) fn add(&mut self, bytes: &[u8]) -> Result<(), BuildError> {
+        let mut prev = StateID::ZERO;
+        let mut it = bytes.iter().copied();
+        while let Some(b) = if self.rev { it.next_back() } else { it.next() } {
+            prev = self.get_or_add_state(prev, b)?;
+        }
+        self.states[prev].add_match();
+        Ok(())
+    }
+
+    /// If the given transition is defined, then return the next state ID.
+    /// Otherwise, add the transition to `from` and point it to a new state.
+    ///
+    /// If a new state ID could not be allocated, then an error is returned.
+    fn get_or_add_state(
+        &mut self,
+        from: StateID,
+        byte: u8,
+    ) -> Result<StateID, BuildError> {
+        let active = self.states[from].active_chunk();
+        match active.binary_search_by_key(&byte, |t| t.byte) {
+            Ok(i) => Ok(active[i].next),
+            Err(i) => {
+                // Add a new state and get its ID.
+                let next = StateID::new(self.states.len()).map_err(|_| {
+                    BuildError::too_many_states(self.states.len())
+                })?;
+                self.states.push(State::default());
+                // Offset our position to account for all transitions and not
+                // just the ones in the active chunk.
+                let i = self.states[from].active_chunk_start() + i;
+                let t = Transition { byte, next };
+                self.states[from].transitions.insert(i, t);
+                Ok(next)
+            }
+        }
+    }
+
+    /// Compile this literal trie to the NFA builder given.
+    ///
+    /// This forwards any errors that may occur while using the given builder.
+    pub(crate) fn compile(
+        &self,
+        builder: &mut Builder,
+    ) -> Result<ThompsonRef, BuildError> {
+        // Compilation proceeds via depth-first traversal of the trie.
+        //
+        // This is overall pretty brutal. The recursive version of this is
+        // deliciously simple. (See 'compile_to_hir' below for what it might
+        // look like.) But recursion on a trie means your call stack grows
+        // in accordance with the longest literal, which just does not seem
+        // appropriate. So we push the call stack to the heap. But as a result,
+        // the trie traversal becomes pretty brutal because we essentially
+        // have to encode the state of a double for-loop into an explicit call
+        // frame. If someone can simplify this without using recursion, that'd
+        // be great.
+
+        // 'end' is our match state for this trie, but represented in the the
+        // NFA. Any time we see a match in the trie, we insert a transition
+        // from the current state we're in to 'end'.
+        let end = builder.add_empty()?;
+        let mut stack = vec![];
+        let mut f = Frame::new(&self.states[StateID::ZERO]);
+        loop {
+            if let Some(t) = f.transitions.next() {
+                if self.states[t.next].is_leaf() {
+                    f.sparse.push(thompson::Transition {
+                        start: t.byte,
+                        end: t.byte,
+                        next: end,
+                    });
+                } else {
+                    f.sparse.push(thompson::Transition {
+                        start: t.byte,
+                        end: t.byte,
+                        // This is a little funny, but when the frame we create
+                        // below completes, it will pop this parent frame off
+                        // and modify this transition to point to the correct
+                        // state.
+                        next: StateID::ZERO,
+                    });
+                    stack.push(f);
+                    f = Frame::new(&self.states[t.next]);
+                }
+                continue;
+            }
+            // At this point, we have visited all transitions in f.chunk, so
+            // add it as a sparse NFA state. Unless the chunk was empty, in
+            // which case, we don't do anything.
+            if !f.sparse.is_empty() {
+                let chunk_id = if f.sparse.len() == 1 {
+                    builder.add_range(f.sparse.pop().unwrap())?
+                } else {
+                    let sparse = mem::replace(&mut f.sparse, vec![]);
+                    builder.add_sparse(sparse)?
+                };
+                f.union.push(chunk_id);
+            }
+            // Now we need to look to see if there are other chunks to visit.
+            if let Some(chunk) = f.chunks.next() {
+                // If we're here, it means we're on the second (or greater)
+                // chunk, which implies there is a match at this point. So
+                // connect this state to the final end state.
+                f.union.push(end);
+                // Advance to the next chunk.
+                f.transitions = chunk.iter();
+                continue;
+            }
+            // Now that we are out of chunks, we have completely visited
+            // this state. So turn our union of chunks into an NFA union
+            // state, and add that union state to the parent state's current
+            // sparse state. (If there is no parent, we're done.)
+            let start = builder.add_union(f.union)?;
+            match stack.pop() {
+                None => {
+                    return Ok(ThompsonRef { start, end });
+                }
+                Some(mut parent) => {
+                    // OK because the only way a frame gets pushed on to the
+                    // stack (aside from the root) is when a transition has
+                    // been added to 'sparse'.
+                    parent.sparse.last_mut().unwrap().next = start;
+                    f = parent;
+                }
+            }
+        }
+    }
+
+    /// Converts this trie to an equivalent HIR expression.
+    ///
+    /// We don't actually use this, but it's useful for tests. In particular,
+    /// it provides a (somewhat) human readable representation of the trie
+    /// itself.
+    #[cfg(test)]
+    fn compile_to_hir(&self) -> regex_syntax::hir::Hir {
+        self.compile_state_to_hir(StateID::ZERO)
+    }
+
+    /// The recursive implementation of 'to_hir'.
+    ///
+    /// Notice how simple this is compared to 'compile' above. 'compile' could
+    /// be similarly simple, but we opt to not use recursion in order to avoid
+    /// overflowing the stack in the case of a longer literal.
+    #[cfg(test)]
+    fn compile_state_to_hir(&self, sid: StateID) -> regex_syntax::hir::Hir {
+        use regex_syntax::hir::Hir;
+
+        let mut alt = vec![];
+        for (i, chunk) in self.states[sid].chunks().enumerate() {
+            if i > 0 {
+                alt.push(Hir::empty());
+            }
+            if chunk.is_empty() {
+                continue;
+            }
+            let mut chunk_alt = vec![];
+            for t in chunk.iter() {
+                chunk_alt.push(Hir::concat(vec![
+                    Hir::literal(vec![t.byte]),
+                    self.compile_state_to_hir(t.next),
+                ]));
+            }
+            alt.push(Hir::alternation(chunk_alt));
+        }
+        Hir::alternation(alt)
+    }
+}
+
+impl core::fmt::Debug for LiteralTrie {
+    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
+        writeln!(f, "LiteralTrie(")?;
+        for (sid, state) in self.states.iter().with_state_ids() {
+            writeln!(f, "{:06?}: {:?}", sid.as_usize(), state)?;
+        }
+        writeln!(f, ")")?;
+        Ok(())
+    }
+}
+
+/// An explicit stack frame used for traversing the trie without using
+/// recursion.
+///
+/// Each frame is tied to the traversal of a single trie state. The frame is
+/// dropped once the entire state (and all of its children) have been visited.
+/// The "output" of compiling a state is the 'union' vector, which is turn
+/// converted to a NFA union state. Each branch of the union corresponds to a
+/// chunk in the trie state.
+///
+/// 'sparse' corresponds to the set of transitions for a particular chunk in a
+/// trie state. It is ultimately converted to an NFA sparse state. The 'sparse'
+/// field, after being converted to a sparse NFA state, is reused for any
+/// subsequent chunks in the trie state, if any exist.
+#[derive(Debug)]
+struct Frame<'a> {
+    /// The remaining chunks to visit for a trie state.
+    chunks: StateChunksIter<'a>,
+    /// The transitions of the current chunk that we're iterating over. Since
+    /// every trie state has at least one chunk, every frame is initialized
+    /// with the first chunk's transitions ready to be consumed.
+    transitions: core::slice::Iter<'a, Transition>,
+    /// The NFA state IDs pointing to the start of each chunk compiled by
+    /// this trie state. This ultimately gets converted to an NFA union once
+    /// the entire trie state (and all of its children) have been compiled.
+    /// The order of these matters for leftmost-first match semantics, since
+    /// earlier matches in the union are preferred over later ones.
+    union: Vec<StateID>,
+    /// The actual NFA transitions for a single chunk in a trie state. This
+    /// gets converted to an NFA sparse state, and its corresponding NFA state
+    /// ID should get added to 'union'.
+    sparse: Vec<thompson::Transition>,
+}
+
+impl<'a> Frame<'a> {
+    /// Create a new stack frame for trie traversal. This initializes the
+    /// 'transitions' iterator to the transitions for the first chunk, with the
+    /// 'chunks' iterator being every chunk after the first one.
+    fn new(state: &'a State) -> Frame<'a> {
+        let mut chunks = state.chunks();
+        // every state has at least 1 chunk
+        let chunk = chunks.next().unwrap();
+        let transitions = chunk.iter();
+        Frame { chunks, transitions, union: vec![], sparse: vec![] }
+    }
+}
+
+/// A state in a trie.
+///
+/// This uses a sparse representation. Since we don't use literal tries
+/// for searching, and ultimately (and compilation requires visiting every
+/// transition anyway), we use a sparse representation for transitions. This
+/// means we save on memory, at the expense of 'LiteralTrie::add' being perhaps
+/// a bit slower.
+///
+/// While 'transitions' is pretty standard as far as tries goes, the 'chunks'
+/// piece here is more unusual. In effect, 'chunks' defines a partitioning
+/// of 'transitions', where each chunk corresponds to a distinct set of
+/// transitions. The key invariant is that a transition in one chunk cannot
+/// be moved to another chunk. This is the secret sauce that preserve
+/// leftmost-first match semantics.
+///
+/// A new chunk is added whenever we mark a state as a match state. Once a
+/// new chunk is added, the old active chunk is frozen and is never mutated
+/// again. The new chunk becomes the active chunk, which is defined as
+/// '&transitions[chunks.last().map_or(0, |c| c.1)..]'. Thus, a state where
+/// 'chunks' is empty actually contains one chunk. Thus, every state contains
+/// at least one (possibly empty) chunk.
+///
+/// A "leaf" state is a state that has no outgoing transitions (so
+/// 'transitions' is empty). Note that there is no way for a leaf state to be a
+/// non-matching state. (Although while building the trie, within 'add', a leaf
+/// state may exist while not containing any matches. But this invariant is
+/// only broken within 'add'. Once 'add' returns, the invariant is upheld.)
+#[derive(Clone, Default)]
+struct State {
+    transitions: Vec<Transition>,
+    chunks: Vec<(usize, usize)>,
+}
+
+impl State {
+    /// Mark this state as a match state and freeze the active chunk such that
+    /// it can not be further mutated.
+    fn add_match(&mut self) {
+        // This is not strictly necessary, but there's no point in recording
+        // another match by adding another chunk if the state has no
+        // transitions. Note though that we only skip this if we already know
+        // this is a match state, which is only true if 'chunks' is not empty.
+        // Basically, if we didn't do this, nothing semantically would change,
+        // but we'd end up pushing another chunk and potentially triggering an
+        // alloc.
+        if self.transitions.is_empty() && !self.chunks.is_empty() {
+            return;
+        }
+        let chunk_start = self.active_chunk_start();
+        let chunk_end = self.transitions.len();
+        self.chunks.push((chunk_start, chunk_end));
+    }
+
+    /// Returns true if and only if this state is a leaf state. That is, a
+    /// state that has no outgoing transitions.
+    fn is_leaf(&self) -> bool {
+        self.transitions.is_empty()
+    }
+
+    /// Returns an iterator over all of the chunks (including the currently
+    /// active chunk) in this state. Since the active chunk is included, the
+    /// iterator is guaranteed to always yield at least one chunk (although the
+    /// chunk may be empty).
+    fn chunks(&self) -> StateChunksIter<'_> {
+        StateChunksIter {
+            transitions: &*self.transitions,
+            chunks: self.chunks.iter(),
+            active: Some(self.active_chunk()),
+        }
+    }
+
+    /// Returns the active chunk as a slice of transitions.
+    fn active_chunk(&self) -> &[Transition] {
+        let start = self.active_chunk_start();
+        &self.transitions[start..]
+    }
+
+    /// Returns the index into 'transitions' where the active chunk starts.
+    fn active_chunk_start(&self) -> usize {
+        self.chunks.last().map_or(0, |&(_, end)| end)
+    }
+}
+
+impl core::fmt::Debug for State {
+    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
+        let mut spacing = " ";
+        for (i, chunk) in self.chunks().enumerate() {
+            if i > 0 {
+                write!(f, "{}MATCH", spacing)?;
+            }
+            spacing = "";
+            for (j, t) in chunk.iter().enumerate() {
+                spacing = " ";
+                if j == 0 && i > 0 {
+                    write!(f, " ")?;
+                } else if j > 0 {
+                    write!(f, ", ")?;
+                }
+                write!(f, "{:?}", t)?;
+            }
+        }
+        Ok(())
+    }
+}
+
+/// An iterator over all of the chunks in a state, including the active chunk.
+///
+/// This iterator is created by `State::chunks`. We name this iterator so that
+/// we can include it in the `Frame` type for non-recursive trie traversal.
+#[derive(Debug)]
+struct StateChunksIter<'a> {
+    transitions: &'a [Transition],
+    chunks: core::slice::Iter<'a, (usize, usize)>,
+    active: Option<&'a [Transition]>,
+}
+
+impl<'a> Iterator for StateChunksIter<'a> {
+    type Item = &'a [Transition];
+
+    fn next(&mut self) -> Option<&'a [Transition]> {
+        if let Some(&(start, end)) = self.chunks.next() {
+            return Some(&self.transitions[start..end]);
+        }
+        if let Some(chunk) = self.active.take() {
+            return Some(chunk);
+        }
+        None
+    }
+}
+
+/// A single transition in a trie to another state.
+#[derive(Clone, Copy)]
+struct Transition {
+    byte: u8,
+    next: StateID,
+}
+
+impl core::fmt::Debug for Transition {
+    fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
+        write!(
+            f,
+            "{:?} => {}",
+            crate::util::escape::DebugByte(self.byte),
+            self.next.as_usize()
+        )
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use bstr::B;
+    use regex_syntax::hir::Hir;
+
+    use super::*;
+
+    #[test]
+    fn zap() {
+        let mut trie = LiteralTrie::forward();
+        trie.add(b"zapper").unwrap();
+        trie.add(b"z").unwrap();
+        trie.add(b"zap").unwrap();
+
+        let got = trie.compile_to_hir();
+        let expected = Hir::concat(vec![
+            Hir::literal(B("z")),
+            Hir::alternation(vec![
+                Hir::literal(B("apper")),
+                Hir::empty(),
+                Hir::literal(B("ap")),
+            ]),
+        ]);
+        assert_eq!(expected, got);
+    }
+
+    #[test]
+    fn maker() {
+        let mut trie = LiteralTrie::forward();
+        trie.add(b"make").unwrap();
+        trie.add(b"maple").unwrap();
+        trie.add(b"maker").unwrap();
+
+        let got = trie.compile_to_hir();
+        let expected = Hir::concat(vec![
+            Hir::literal(B("ma")),
+            Hir::alternation(vec![
+                Hir::concat(vec![
+                    Hir::literal(B("ke")),
+                    Hir::alternation(vec![Hir::empty(), Hir::literal(B("r"))]),
+                ]),
+                Hir::literal(B("ple")),
+            ]),
+        ]);
+        assert_eq!(expected, got);
+    }
+}
--- a/third-party/vendor/regex-automata/src/nfa/thompson/map.rs
+++ b/third-party/vendor/regex-automata/src/nfa/thompson/map.rs
@ -0,0 +1,296 @@
+// This module contains a couple simple and purpose built hash maps. The key
+// trade off they make is that they serve as caches rather than true maps. That
+// is, inserting a new entry may cause eviction of another entry. This gives
+// us two things. First, there's less overhead associated with inserts and
+// lookups. Secondly, it lets us control our memory usage.
+//
+// These maps are used in some fairly hot code when generating NFA states for
+// large Unicode character classes.
+//
+// Instead of exposing a rich hashmap entry API, we just permit the caller to
+// produce a hash of the key directly. The hash can then be reused for both
+// lookups and insertions at the cost of leaking abstraction a bit. But these
+// are for internal use only, so it's fine.
+//
+// The Utf8BoundedMap is used for Daciuk's algorithm for constructing a
+// (almost) minimal DFA for large Unicode character classes in linear time.
+// (Daciuk's algorithm is always used when compiling forward NFAs. For reverse
+// NFAs, it's only used when the compiler is configured to 'shrink' the NFA,
+// since there's a bit more expense in the reverse direction.)
+//
+// The Utf8SuffixMap is used when compiling large Unicode character classes for
+// reverse NFAs when 'shrink' is disabled. Specifically, it augments the naive
+// construction of UTF-8 automata by caching common suffixes. This doesn't
+// get the same space savings as Daciuk's algorithm, but it's basically as
+// fast as the naive approach and typically winds up using less memory (since
+// it generates smaller NFAs) despite the presence of the cache.
+//
+// These maps effectively represent caching mechanisms for sparse and
+// byte-range NFA states, respectively. The former represents a single NFA
+// state with many transitions of equivalent priority while the latter
+// represents a single NFA state with a single transition. (Neither state ever
+// has or is an epsilon transition.) Thus, they have different key types. It's
+// likely we could make one generic map, but the machinery didn't seem worth
+// it. They are simple enough.
+
+use alloc::{vec, vec::Vec};
+
+use crate::{
+    nfa::thompson::Transition,
+    util::{
+        int::{Usize, U64},
+        primitives::StateID,
+    },
+};
+
+// Basic FNV-1a hash constants as described in:
+// https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function
+const PRIME: u64 = 1099511628211;
+const INIT: u64 = 14695981039346656037;
+
+/// A bounded hash map where the key is a sequence of NFA transitions and the
+/// value is a pre-existing NFA state ID.
+///
+/// std's hashmap can be used for this, however, this map has two important
+/// advantages. Firstly, it has lower overhead. Secondly, it permits us to
+/// control our memory usage by limited the number of slots. In general, the
+/// cost here is that this map acts as a cache. That is, inserting a new entry
+/// may remove an old entry. We are okay with this, since it does not impact
+/// correctness in the cases where it is used. The only effect that dropping
+/// states from the cache has is that the resulting NFA generated may be bigger
+/// than it otherwise would be.
+///
+/// This improves benchmarks that compile large Unicode character classes,
+/// since it makes the generation of (almost) minimal UTF-8 automaton faster.
+/// Specifically, one could observe the difference with std's hashmap via
+/// something like the following benchmark:
+///
+///   hyperfine "regex-cli debug thompson -qr --captures none '\w{90} ecurB'"
+///
+/// But to observe that difference, you'd have to modify the code to use
+/// std's hashmap.
+///
+/// It is quite possible that there is a better way to approach this problem.
+/// For example, if there happens to be a very common state that collides with
+/// a lot of less frequent states, then we could wind up with very poor caching
+/// behavior. Alas, the effectiveness of this cache has not been measured.
+/// Instead, ad hoc experiments suggest that it is "good enough." Additional
+/// smarts (such as an LRU eviction policy) have to be weighed against the
+/// amount of extra time they cost.
+#[derive(Clone, Debug)]
+pub struct Utf8BoundedMap {
+    /// The current version of this map. Only entries with matching versions
+    /// are considered during lookups. If an entry is found with a mismatched
+    /// version, then the map behaves as if the entry does not exist.
+    ///
+    /// This makes it possible to clear the map by simply incrementing the
+    /// version number instead of actually deallocating any storage.
+    version: u16,
+    /// The total number of entries this map can store.
+    capacity: usize,
+    /// The actual entries, keyed by hash. Collisions between different states
+    /// result in the old state being dropped.
+    map: Vec<Utf8BoundedEntry>,
+}
+
+/// An entry in this map.
+#[derive(Clone, Debug, Default)]
+struct Utf8BoundedEntry {
+    /// The version of the map used to produce this entry. If this entry's
+    /// version does not match the current version of the map, then the map
+    /// should behave as if this entry does not exist.
+    version: u16,
+    /// The key, which is a sorted sequence of non-overlapping NFA transitions.
+    key: Vec<Transition>,
+    /// The state ID corresponding to the state containing the transitions in
+    /// this entry.
+    val: StateID,
+}
+
+impl Utf8BoundedMap {
+    /// Create a new bounded map with the given capacity. The map will never
+    /// grow beyond the given size.
+    ///
+    /// Note that this does not allocate. Instead, callers must call `clear`
+    /// before using this map. `clear` will allocate space if necessary.
+    ///
+    /// This avoids the need to pay for the allocation of this map when
+    /// compiling regexes that lack large Unicode character classes.
+    pub fn new(capacity: usize) -> Utf8BoundedMap {
+        assert!(capacity > 0);
+        Utf8BoundedMap { version: 0, capacity, map: vec![] }
+    }
+
+    /// Clear this map of all entries, but permit the reuse of allocation
+    /// if possible.
+    ///
+    /// This must be called before the map can be used.
+    pub fn clear(&mut self) {
+        if self.map.is_empty() {
+            self.map = vec![Utf8BoundedEntry::default(); self.capacity];
+        } else {
+            self.version = self.version.wrapping_add(1);
+            // If we loop back to version 0, then we forcefully clear the
+            // entire map. Otherwise, it might be possible to incorrectly
+            // match entries used to generate other NFAs.
+            if self.version == 0 {
+                self.map = vec![Utf8BoundedEntry::default(); self.capacity];
+            }
+        }
+    }
+
+    /// Return a hash of the given transitions.
+    pub fn hash(&self, key: &[Transition]) -> usize {
+        let mut h = INIT;
+        for t in key {
+            h = (h ^ u64::from(t.start)).wrapping_mul(PRIME);
+            h = (h ^ u64::from(t.end)).wrapping_mul(PRIME);
+            h = (h ^ t.next.as_u64()).wrapping_mul(PRIME);
+        }
+        (h % self.map.len().as_u64()).as_usize()
+    }
+
+    /// Retrieve the cached state ID corresponding to the given key. The hash
+    /// given must have been computed with `hash` using the same key value.
+    ///
+    /// If there is no cached state with the given transitions, then None is
+    /// returned.
+    pub fn get(&mut self, key: &[Transition], hash: usize) -> Option<StateID> {
+        let entry = &self.map[hash];
+        if entry.version != self.version {
+            return None;
+        }
+        // There may be a hash collision, so we need to confirm real equality.
+        if entry.key != key {
+            return None;
+        }
+        Some(entry.val)
+    }
+
+    /// Add a cached state to this map with the given key. Callers should
+    /// ensure that `state_id` points to a state that contains precisely the
+    /// NFA transitions given.
+    ///
+    /// `hash` must have been computed using the `hash` method with the same
+    /// key.
+    pub fn set(
+        &mut self,
+        key: Vec<Transition>,
+        hash: usize,
+        state_id: StateID,
+    ) {
+        self.map[hash] =
+            Utf8BoundedEntry { version: self.version, key, val: state_id };
+    }
+}
+
+/// A cache of suffixes used to modestly compress UTF-8 automata for large
+/// Unicode character classes.
+#[derive(Clone, Debug)]
+pub struct Utf8SuffixMap {
+    /// The current version of this map. Only entries with matching versions
+    /// are considered during lookups. If an entry is found with a mismatched
+    /// version, then the map behaves as if the entry does not exist.
+    version: u16,
+    /// The total number of entries this map can store.
+    capacity: usize,
+    /// The actual entries, keyed by hash. Collisions between different states
+    /// result in the old state being dropped.
+    map: Vec<Utf8SuffixEntry>,
+}
+
+/// A key that uniquely identifies an NFA state. It is a triple that represents
+/// a transition from one state for a particular byte range.
+#[derive(Clone, Debug, Default, Eq, PartialEq)]
+pub struct Utf8SuffixKey {
+    pub from: StateID,
+    pub start: u8,
+    pub end: u8,
+}
+
+/// An entry in this map.
+#[derive(Clone, Debug, Default)]
+struct Utf8SuffixEntry {
+    /// The version of the map used to produce this entry. If this entry's
+    /// version does not match the current version of the map, then the map
+    /// should behave as if this entry does not exist.
+    version: u16,
+    /// The key, which consists of a transition in a particular state.
+    key: Utf8SuffixKey,
+    /// The identifier that the transition in the key maps to.
+    val: StateID,
+}
+
+impl Utf8SuffixMap {
+    /// Create a new bounded map with the given capacity. The map will never
+    /// grow beyond the given size.
+    ///
+    /// Note that this does not allocate. Instead, callers must call `clear`
+    /// before using this map. `clear` will allocate space if necessary.
+    ///
+    /// This avoids the need to pay for the allocation of this map when
+    /// compiling regexes that lack large Unicode character classes.
+    pub fn new(capacity: usize) -> Utf8SuffixMap {
+        assert!(capacity > 0);
+        Utf8SuffixMap { version: 0, capacity, map: vec![] }
+    }
+
+    /// Clear this map of all entries, but permit the reuse of allocation
+    /// if possible.
+    ///
+    /// This must be called before the map can be used.
+    pub fn clear(&mut self) {
+        if self.map.is_empty() {
+            self.map = vec![Utf8SuffixEntry::default(); self.capacity];
+        } else {
+            self.version = self.version.wrapping_add(1);
+            if self.version == 0 {
+                self.map = vec![Utf8SuffixEntry::default(); self.capacity];
+            }
+        }
+    }
+
+    /// Return a hash of the given transition.
+    pub fn hash(&self, key: &Utf8SuffixKey) -> usize {
+        // Basic FNV-1a hash as described:
+        // https://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93Vo_hash_function
+        const PRIME: u64 = 1099511628211;
+        const INIT: u64 = 14695981039346656037;
+
+        let mut h = INIT;
+        h = (h ^ key.from.as_u64()).wrapping_mul(PRIME);
+        h = (h ^ u64::from(key.start)).wrapping_mul(PRIME);
+        h = (h ^ u64::from(key.end)).wrapping_mul(PRIME);
+        (h % self.map.len().as_u64()).as_usize()
+    }
+
+    /// Retrieve the cached state ID corresponding to the given key. The hash
+    /// given must have been computed with `hash` using the same key value.
+    ///
+    /// If there is no cached state with the given key, then None is returned.
+    pub fn get(
+        &mut self,
+        key: &Utf8SuffixKey,
+        hash: usize,
+    ) -> Option<StateID> {
+        let entry = &self.map[hash];
+        if entry.version != self.version {
+            return None;
+        }
+        if key != &entry.key {
+            return None;
+        }
+        Some(entry.val)
+    }
+
+    /// Add a cached state to this map with the given key. Callers should
+    /// ensure that `state_id` points to a state that contains precisely the
+    /// NFA transition given.
+    ///
+    /// `hash` must have been computed using the `hash` method with the same
+    /// key.
+    pub fn set(&mut self, key: Utf8SuffixKey, hash: usize, state_id: StateID) {
+        self.map[hash] =
+            Utf8SuffixEntry { version: self.version, key, val: state_id };
+    }
+}
--- a/third-party/vendor/regex-automata/src/nfa/thompson/mod.rs
+++ b/third-party/vendor/regex-automata/src/nfa/thompson/mod.rs
@ -0,0 +1,81 @@
+/*!
+Defines a Thompson NFA and provides the [`PikeVM`](pikevm::PikeVM) and
+[`BoundedBacktracker`](backtrack::BoundedBacktracker) regex engines.
+
+A Thompson NFA (non-deterministic finite automaton) is arguably _the_ central
+data type in this library. It is the result of what is commonly referred to as
+"regex compilation." That is, turning a regex pattern from its concrete syntax
+string into something that can run a search looks roughly like this:
+
+* A `&str` is parsed into a [`regex-syntax::ast::Ast`](regex_syntax::ast::Ast).
+* An `Ast` is translated into a [`regex-syntax::hir::Hir`](regex_syntax::hir::Hir).
+* An `Hir` is compiled into a [`NFA`].
+* The `NFA` is then used to build one of a few different regex engines:
+  * An `NFA` is used directly in the `PikeVM` and `BoundedBacktracker` engines.
+  * An `NFA` is used by a [hybrid NFA/DFA](crate::hybrid) to build out a DFA's
+  transition table at search time.
+  * An `NFA`, assuming it is one-pass, is used to build a full
+  [one-pass DFA](crate::dfa::onepass) ahead of time.
+  * An `NFA` is used to build a [full DFA](crate::dfa) ahead of time.
+
+The [`meta`](crate::meta) regex engine makes all of these choices for you based
+on various criteria. However, if you have a lower level use case, _you_ can
+build any of the above regex engines and use them directly. But you must start
+here by building an `NFA`.
+
+# Details
+
+It is perhaps worth expanding a bit more on what it means to go through the
+`&str`->`Ast`->`Hir`->`NFA` process.
+
+* Parsing a string into an `Ast` gives it a structured representation.
+Crucially, the size and amount of work done in this step is proportional to the
+size of the original string. No optimization or Unicode handling is done at
+this point. This means that parsing into an `Ast` has very predictable costs.
+Moreover, an `Ast` can be roundtripped back to its original pattern string as
+written.
+* Translating an `Ast` into an `Hir` is a process by which the structured
+representation is simplified down to its most fundamental components.
+Translation deals with flags such as case insensitivity by converting things
+like `(?i:a)` to `[Aa]`. Translation is also where Unicode tables are consulted
+to resolve things like `\p{Emoji}` and `\p{Greek}`. It also flattens each
+character class, regardless of how deeply nested it is, into a single sequence
+of non-overlapping ranges. All the various literal forms are thrown out in
+favor of one common representation. Overall, the `Hir` is small enough to fit
+into your head and makes analysis and other tasks much simpler.
+* Compiling an `Hir` into an `NFA` formulates the regex into a finite state
+machine whose transitions are defined over bytes. For example, an `Hir` might
+have a Unicode character class corresponding to a sequence of ranges defined
+in terms of `char`. Compilation is then responsible for turning those ranges
+into a UTF-8 automaton. That is, an automaton that matches the UTF-8 encoding
+of just the codepoints specified by those ranges. Otherwise, the main job of
+an `NFA` is to serve as a byte-code of sorts for a virtual machine. It can be
+seen as a sequence of instructions for how to match a regex.
+*/
+
+#[cfg(feature = "nfa-backtrack")]
+pub mod backtrack;
+mod builder;
+#[cfg(feature = "syntax")]
+mod compiler;
+mod error;
+#[cfg(feature = "syntax")]
+mod literal_trie;
+#[cfg(feature = "syntax")]
+mod map;
+mod nfa;
+#[cfg(feature = "nfa-pikevm")]
+pub mod pikevm;
+#[cfg(feature = "syntax")]
+mod range_trie;
+
+pub use self::{
+    builder::Builder,
+    error::BuildError,
+    nfa::{
+        DenseTransitions, PatternIter, SparseTransitions, State, Transition,
+        NFA,
+    },
+};
+#[cfg(feature = "syntax")]
+pub use compiler::{Compiler, Config, WhichCaptures};
--- a/third-party/vendor/regex-automata/src/nfa/thompson/nfa.rs
+++ b/third-party/vendor/regex-automata/src/nfa/thompson/nfa.rs
--- a/third-party/vendor/regex-automata/src/nfa/thompson/pikevm.rs
+++ b/third-party/vendor/regex-automata/src/nfa/thompson/pikevm.rs
--- a/third-party/vendor/regex-automata/src/nfa/thompson/range_trie.rs
+++ b/third-party/vendor/regex-automata/src/nfa/thompson/range_trie.rs