| 1 | ---
|
| 2 | default_highlighter: oils-sh
|
| 3 | ---
|
| 4 |
|
| 5 | Egg Expressions (YSH Regexes)
|
| 6 | =============================
|
| 7 |
|
| 8 | YSH has a new syntax for patterns, which appears between the `/ /` delimiters:
|
| 9 |
|
| 10 | if (mystr ~ /d+ '.' d+/) {
|
| 11 | echo 'mystr looks like a number N.M'
|
| 12 | }
|
| 13 |
|
| 14 | These patterns are intended to be familiar, but they differ from POSIX or Perl
|
| 15 | expressions in important ways. So we call them *eggexes* rather than
|
| 16 | *regexes*!
|
| 17 |
|
| 18 | <!-- cmark.py expands this -->
|
| 19 | <div id="toc">
|
| 20 | </div>
|
| 21 |
|
| 22 | ## Why Invent a New Language?
|
| 23 |
|
| 24 | - Eggexes let you name **subpatterns** and compose them, which makes them more
|
| 25 | readable and testable.
|
| 26 | - Their **syntax** is vastly simpler because literal characters are **quoted**,
|
| 27 | and operators are not. For example, `^` no longer means three totally
|
| 28 | different things. See the critique at the end of this doc.
|
| 29 | - bash and awk use the limited and verbose POSIX ERE syntax, while eggexes are
|
| 30 | more expressive and (in some cases) Perl-like.
|
| 31 | - They're designed to be **translated to any regex dialect**. Right now, the
|
| 32 | YSH shell translates them to ERE so you can use them with common Unix tools:
|
| 33 | - `egrep` (`grep -E`)
|
| 34 | - `awk`
|
| 35 | - GNU `sed --regexp-extended`
|
| 36 | - PCRE syntax is the second most important target.
|
| 37 | - They're **statically parsed** in YSH, so:
|
| 38 | - You can get **syntax errors** at parse time. In contrast, if you embed a
|
| 39 | regex in a string, you don't get syntax errors until runtime.
|
| 40 | - The eggex is part of the [lossless syntax tree][], which means you can do
|
| 41 | linting, formatting, and refactoring on eggexes, just like any other type
|
| 42 | of code.
|
| 43 | - Eggexes support **regular languages** in the mathematical sense, whereas
|
| 44 | regexes are **confused** about the issue. All nonregular eggex extensions
|
| 45 | are prefixed with `!!`, so you can visually audit them for [catastrophic
|
| 46 | backtracking][backtracking]. (Russ Cox, author of the RE2 engine, [has
|
| 47 | written extensively](https://swtch.com/~rsc/regexp/) on this issue.)
|
| 48 | - Eggexes are more fun than regexes!
|
| 49 |
|
| 50 | [backtracking]: https://blog.codinghorror.com/regex-performance/
|
| 51 |
|
| 52 | [lossless syntax tree]: http://www.oilshell.org/blog/2017/02/11.html
|
| 53 |
|
| 54 | ### Example of Pattern Reuse
|
| 55 |
|
| 56 | Here's a longer example:
|
| 57 |
|
| 58 | # Define a subpattern. 'digit' and 'd' are the same.
|
| 59 | $ var D = / digit{1,3} /
|
| 60 |
|
| 61 | # Use the subpattern
|
| 62 | $ var ip_pat = / D '.' D '.' D '.' D /
|
| 63 |
|
| 64 | # This eggex compiles to an ERE
|
| 65 | $ echo $ip_pat
|
| 66 | [[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}
|
| 67 |
|
| 68 | This means you can use it in a very simple way:
|
| 69 |
|
| 70 | $ egrep $ip_pat foo.txt
|
| 71 |
|
| 72 | TODO: You should also be able to inline patterns like this:
|
| 73 |
|
| 74 | egrep $/d+/ foo.txt
|
| 75 |
|
| 76 | ### Design Philosophy
|
| 77 |
|
| 78 | - Eggexes can express a **superset** of POSIX and Perl syntax.
|
| 79 | - The language is designed for "dumb", one-to-one, **syntactic** translations.
|
| 80 | That is, translation doesn't rely on understanding the **semantics** of
|
| 81 | regexes. This is because regex implementations have many corner cases and
|
| 82 | incompatibilities, with regard to Unicode, `NUL` bytes, etc.
|
| 83 |
|
| 84 | ### The Expression Language Is Consistent
|
| 85 |
|
| 86 | Eggexes have a consistent syntax:
|
| 87 |
|
| 88 | - Single characters are unadorned, in lowercase: `dot`, `space`, or `s`
|
| 89 | - A sequence of multiple characters looks like `'lit'`, `$var`, etc.
|
| 90 | - Constructs that match **zero** characters look like `%start`, `%word_end`, etc.
|
| 91 | - Entire subpatterns (which may contain alternation, repetition, etc.) are in
|
| 92 | uppercase like `HexDigit`. Important: these are **spliced** as syntax trees,
|
| 93 | not strings, so you **don't** need to think about quoting.
|
| 94 |
|
| 95 | For example, it's easy to see that these patterns all match **three** characters:
|
| 96 |
|
| 97 | / d d d /
|
| 98 | / digit digit digit /
|
| 99 | / dot dot dot /
|
| 100 | / word space word /
|
| 101 | / 'ab' space /
|
| 102 | / 'abc' /
|
| 103 |
|
| 104 | And that these patterns match **two**:
|
| 105 |
|
| 106 | / %start w w /
|
| 107 | / %start 'if' /
|
| 108 | / d d %end /
|
| 109 |
|
| 110 | And that you have to look up the definition of `HexDigit` to know how many
|
| 111 | characters this matches:
|
| 112 |
|
| 113 | / %start HexDigit %end /
|
| 114 |
|
| 115 | Constructs like `. ^ $ \< \>` are deprecated because they break these rules.
|
| 116 |
|
| 117 | ## Expression Primitives
|
| 118 |
|
| 119 | ### `.` Is Now `dot`
|
| 120 |
|
| 121 | But `.` is still accepted. It usually matches any character except a newline,
|
| 122 | although this changes based on flags (e.g. `dotall`, `unicode`).
|
| 123 |
|
| 124 | ### Classes Are Unadorned: `word`, `w`, `alnum`
|
| 125 |
|
| 126 | We accept both Perl and POSIX classes.
|
| 127 |
|
| 128 | - Perl:
|
| 129 | - `d` or `digit`
|
| 130 | - `s` or `space`
|
| 131 | - `w` or `word`
|
| 132 | - POSIX
|
| 133 | - `alpha`, `alnum`, ...
|
| 134 |
|
| 135 | ### Zero-width Assertions Look Like `%this`
|
| 136 |
|
| 137 | - POSIX
|
| 138 | - `%start` is `^`
|
| 139 | - `%end` is `$`
|
| 140 | - PCRE:
|
| 141 | - `%input_start` is `\A`
|
| 142 | - `%input_end` is `\z`
|
| 143 | - `%last_line_end` is `\Z`
|
| 144 | - GNU ERE extensions:
|
| 145 | - `%word_start` is `\<`
|
| 146 | - `%word_end` is `\>`
|
| 147 |
|
| 148 | ### Single-Quoted Strings
|
| 149 |
|
| 150 | - `'hello *world*'` becomes a regex-escaped string
|
| 151 |
|
| 152 | Note: instead of using double-quoted strings like `"xyz $var"`, you can splice
|
| 153 | a strings into an eggex:
|
| 154 |
|
| 155 | / 'xyz ' @var /
|
| 156 |
|
| 157 | ## Compound Expressions
|
| 158 |
|
| 159 | ### Sequence and Alternation Are Unchanged
|
| 160 |
|
| 161 | - `x y` matches `x` and `y` in sequence
|
| 162 | - `x | y` matches `x` or `y`
|
| 163 |
|
| 164 | You can also write a more Pythonic alternative: `x or y`.
|
| 165 |
|
| 166 | ### Repetition Is Unchanged In Common Cases, and Better in Rare Cases
|
| 167 |
|
| 168 | Repetition is just like POSIX ERE or Perl:
|
| 169 |
|
| 170 | - `x?`, `x+`, `x*`
|
| 171 | - `x{3}`, `x{1,3}`
|
| 172 |
|
| 173 | We've reserved syntactic space for PCRE and Python variants:
|
| 174 |
|
| 175 | - lazy/non-greedy: `x{L +}`, `x{L 3,4}`
|
| 176 | - possessive: `x{P +}`, `x{P 3,4}`
|
| 177 |
|
| 178 | ### Negation Consistently Uses !
|
| 179 |
|
| 180 | You can negate named char classes:
|
| 181 |
|
| 182 | / !digit /
|
| 183 |
|
| 184 | and char class literals:
|
| 185 |
|
| 186 | / ![ a-z A-Z ] /
|
| 187 |
|
| 188 | Sometimes you can do both:
|
| 189 |
|
| 190 | / ![ !digit ] / # translates to /[^\D]/ in PCRE
|
| 191 | # error in ERE because it can't be expressed
|
| 192 |
|
| 193 |
|
| 194 | You can also negate "regex modifiers" / compilation flags:
|
| 195 |
|
| 196 | / word ; ignorecase / # flag on
|
| 197 | / word ; !ignorecase / # flag off
|
| 198 | / word ; !i / # abbreviated
|
| 199 |
|
| 200 | In contrast, regexes have many confusing syntaxes for negation:
|
| 201 |
|
| 202 | [^abc] vs. [abc]
|
| 203 | [[^:digit:]] vs. [[:digit:]]
|
| 204 |
|
| 205 | \D vs. \d
|
| 206 |
|
| 207 | /\w/-i vs /\w/i
|
| 208 |
|
| 209 | ### Splice Other Patterns `@var_name` or `UpperCaseVarName`
|
| 210 |
|
| 211 | This allows you to reuse patterns. Using uppercase variables:
|
| 212 |
|
| 213 | var D = / digit{3} /
|
| 214 |
|
| 215 | var ip_addr = / D '.' D '.' D '.' D /
|
| 216 |
|
| 217 | Using normal variables:
|
| 218 |
|
| 219 | var part = / digit{3} /
|
| 220 |
|
| 221 | var ip_addr = / @part '.' @part '.' @part '.' @part /
|
| 222 |
|
| 223 | This is similar to how `lex` and `re2c` work.
|
| 224 |
|
| 225 | ### Group With `()`
|
| 226 |
|
| 227 | Parentheses are used for precdence:
|
| 228 |
|
| 229 | ('foo' | 'bar')+
|
| 230 |
|
| 231 | See note below: When translating to POSIX ERE, grouping becomes a capturing
|
| 232 | group. POSIX ERE has no non-capturing groups.
|
| 233 |
|
| 234 |
|
| 235 | ### Capture with `<capture ...>`
|
| 236 |
|
| 237 | Here's a positional capture:
|
| 238 |
|
| 239 | <capture d+> # Becomes _group(1)
|
| 240 |
|
| 241 | Add a variable after `as` for named capture:
|
| 242 |
|
| 243 | <capture d+ as month> # Becomes _group('month')
|
| 244 |
|
| 245 | You can also add type conversion functions:
|
| 246 |
|
| 247 | <capture d+ : int> # _group(1) returns an Int, not Str
|
| 248 | <capture d+ as month: int> # _group('month') returns an Int, not Str
|
| 249 |
|
| 250 | ### Character Class Literals Use `[]`
|
| 251 |
|
| 252 | Example:
|
| 253 |
|
| 254 | [ a-f 'A'-'F' \xFF \u{03bc} \n \\ \' \" \0 ]
|
| 255 |
|
| 256 | Terms:
|
| 257 |
|
| 258 | - Ranges: `a-f` or `'A' - 'F'`
|
| 259 | - Literals: `\n`, `\x01`, `\u{3bc}`, etc.
|
| 260 | - Sets specified as strings: `'abc'`
|
| 261 |
|
| 262 | Only letters, numbers, and the underscore may be unquoted:
|
| 263 |
|
| 264 | /['a'-'f' 'A'-'F' '0'-'9']/
|
| 265 | /[a-f A-F 0-9]/ # Equivalent to the above
|
| 266 |
|
| 267 | /['!' - ')']/ # Correct range
|
| 268 | /[!-)]/ # Syntax Error
|
| 269 |
|
| 270 | Ranges must be separated by spaces:
|
| 271 |
|
| 272 | No:
|
| 273 |
|
| 274 | /[a-fA-F0-9]/
|
| 275 |
|
| 276 | Yes:
|
| 277 |
|
| 278 | /[a-f A-f 0-9]/
|
| 279 |
|
| 280 | ### Backtracking Constructs Use `!!` (Discouraged)
|
| 281 |
|
| 282 | If you want to translate to PCRE, you can use these.
|
| 283 |
|
| 284 | !!REF 1
|
| 285 | !!REF name
|
| 286 |
|
| 287 | !!AHEAD( d+ )
|
| 288 | !!NOT_AHEAD( d+ )
|
| 289 | !!BEHIND( d+ )
|
| 290 | !!NOT_BEHIND( d+ )
|
| 291 |
|
| 292 | !!ATOMIC( d+ )
|
| 293 |
|
| 294 | Since they all begin with `!!`, You can visually audit your code for potential
|
| 295 | performance problems.
|
| 296 |
|
| 297 | ## Outside the Expression language
|
| 298 |
|
| 299 | ### Flags and Translation Preferences (`;`)
|
| 300 |
|
| 301 | Flags or "regex modifiers" appear after a semicolon:
|
| 302 |
|
| 303 | / digit+ ; i / # ignore case
|
| 304 |
|
| 305 | A translation preference is specified after a second semi-colon:
|
| 306 |
|
| 307 | / digit+ ; ; ERE / # translates to [[:digit:]]+
|
| 308 | / digit+ ; ; python / # could translate to \d+
|
| 309 |
|
| 310 | Flags and translation preferences together:
|
| 311 |
|
| 312 | / digit+ ; ignorecase ; python / # could translate to (?i)\d+
|
| 313 |
|
| 314 | In Oils, the following flags are currently supported:
|
| 315 |
|
| 316 | #### `reg_icase` / `i` (Ignore Case)
|
| 317 |
|
| 318 | Use this flag to ignore case when matching. For example, `/'foo'; i/` matches
|
| 319 | 'FOO', but `/'foo'/` doesn't.
|
| 320 |
|
| 321 | #### `reg_newline` (Multiline)
|
| 322 |
|
| 323 | With this flag, `%end` will match before a newline and `%start` will match
|
| 324 | after a newline.
|
| 325 |
|
| 326 | = u'abc123\n' ~ / digit %end ; reg_newline / # true
|
| 327 | = u'abc\n123' ~ / %start digit ; reg_newline / # true
|
| 328 |
|
| 329 | Without the flag, `%start` and `%end` only match from the start or end of the
|
| 330 | string, respectively.
|
| 331 |
|
| 332 | = u'abc123\n' ~ / digit %end / # false
|
| 333 | = u'abc\n123' ~ / %start digit / # false
|
| 334 |
|
| 335 | Newlines are also ignored in `dot` and `![abc]` patterns.
|
| 336 |
|
| 337 | = u'\n' ~ / . / # true
|
| 338 | = u'\n' ~ / !digit / # true
|
| 339 |
|
| 340 | Without this flag, the newline `\n` is treated as an ordinary character.
|
| 341 |
|
| 342 | = u'\n' ~ / . ; reg_newline / # false
|
| 343 | = u'\n' ~ / !digit ; reg_newline / # false
|
| 344 |
|
| 345 | ### Multiline Syntax
|
| 346 |
|
| 347 | You can spread regexes over multiple lines and add comments:
|
| 348 |
|
| 349 | var x = ///
|
| 350 | digit{4} # year e.g. 2001
|
| 351 | '-'
|
| 352 | digit{2} # month e.g. 06
|
| 353 | '-'
|
| 354 | digit{2} # day e.g. 31
|
| 355 | ///
|
| 356 |
|
| 357 | (Not yet implemented in YSH.)
|
| 358 |
|
| 359 | ### The YSH API
|
| 360 |
|
| 361 | See the [YSH regex API](ysh-regex-api.html) for details.
|
| 362 |
|
| 363 | In summary, YSH has Perl-like conveniences with an `~` operator:
|
| 364 |
|
| 365 | var s = 'on 04-01, 10-31'
|
| 366 | var pat = /<capture d+ as month> '-' <capture d+ as day>/
|
| 367 |
|
| 368 | if (s ~ pat) { # search for the pattern
|
| 369 | echo $[_group('month')] # => 04
|
| 370 | }
|
| 371 |
|
| 372 | It also has an explicit and powerful Python-like API with the `search()` and
|
| 373 | leftMatch()` methods on strings.
|
| 374 |
|
| 375 | var m = s => search(pat, pos=8) # start searching at a position
|
| 376 | if (m) {
|
| 377 | echo $[m => group('month')] # => 10
|
| 378 | }
|
| 379 |
|
| 380 | ### Language Reference
|
| 381 |
|
| 382 | - See bottom of the [YSH Expression Grammar]($oils-src:ysh/grammar.pgen2) for
|
| 383 | the concrete syntax.
|
| 384 | - See the bottom of [frontend/syntax.asdl]($oils-src:frontend/syntax.asdl) for
|
| 385 | the abstract syntax.
|
| 386 |
|
| 387 | ## Usage Notes
|
| 388 |
|
| 389 | ### Use character literals rather than C-Escaped strings
|
| 390 |
|
| 391 | No:
|
| 392 |
|
| 393 | / $'foo\tbar' / # Match 7 characters including a tab, but it's hard to read
|
| 394 | / r'foo\tbar' / # The string must contain 8 chars including '\' and 't'
|
| 395 |
|
| 396 | Yes:
|
| 397 |
|
| 398 | # Instead, Take advantage of char literals and implicit regex concatenation
|
| 399 | / 'foo' \t 'bar' /
|
| 400 | / 'foo' \\ 'tbar' /
|
| 401 |
|
| 402 |
|
| 403 | ## POSIX ERE Limitations
|
| 404 |
|
| 405 | ### Repetition of Strings Requires Grouping
|
| 406 |
|
| 407 | Repetitions like `* + ?` apply only to the last character, so literal strings
|
| 408 | need extra grouping:
|
| 409 |
|
| 410 |
|
| 411 | No:
|
| 412 |
|
| 413 | 'foo'+
|
| 414 |
|
| 415 | Yes:
|
| 416 |
|
| 417 | <capture 'foo'>+
|
| 418 |
|
| 419 | Also OK:
|
| 420 |
|
| 421 | ('foo')+ # this is a CAPTURING group in ERE
|
| 422 |
|
| 423 | This is necessary because ERE doesn't have non-capturing groups like Perl's
|
| 424 | `(?:...)`, and Eggex only does "dumb" translations. It doesn't silently insert
|
| 425 | constructs that change the meaning of the pattern.
|
| 426 |
|
| 427 | ### Unicode char literals are limited in range
|
| 428 |
|
| 429 | ERE can't represent this set of 1 character reliably:
|
| 430 |
|
| 431 | / [ \u{0100} ] / # This char is 2 bytes encoded in UTF-8
|
| 432 |
|
| 433 | These sets are accepted:
|
| 434 |
|
| 435 | / [ \u{1} \u{2} ] / # set of 2 chars
|
| 436 | / [ \x01 \x02 ] ] / # set of 2 bytes
|
| 437 |
|
| 438 | They happen to be identical when translated to ERE, but may not be when
|
| 439 | translated to PCRE.
|
| 440 |
|
| 441 | ### Don't put non-ASCII bytes in string sets in char classes
|
| 442 |
|
| 443 | This is a sequence of characters:
|
| 444 |
|
| 445 | / $'\xfe\xff' /
|
| 446 |
|
| 447 | This is a **set** of characters that is illegal:
|
| 448 |
|
| 449 | / [ $'\xfe\xff' ] / # set or sequence? It's confusing
|
| 450 |
|
| 451 | This is a better way to write it:
|
| 452 |
|
| 453 | / [ \xfe \xff ] / # set of 2 chars
|
| 454 |
|
| 455 | ### Char class literals: `^ - ] \`
|
| 456 |
|
| 457 | The literal characters `^ - ] \` are problematic because they can be confused
|
| 458 | with operators.
|
| 459 |
|
| 460 | - `^` means negation
|
| 461 | - `-` means range
|
| 462 | - `]` closes the character class
|
| 463 | - `\` is usually literal, but GNU gawk has an extension to make it an escaping
|
| 464 | operator
|
| 465 |
|
| 466 | The Eggex-to-ERE translator is smart enough to handle cases like this:
|
| 467 |
|
| 468 | var pat = / ['^' 'x'] /
|
| 469 | # translated to [x^], not [^x] for correctness
|
| 470 |
|
| 471 | However, cases like this are a fatal runtime error:
|
| 472 |
|
| 473 | var pat1 = / ['a'-'^'] /
|
| 474 | var pat2 = / ['a'-'-'] /
|
| 475 |
|
| 476 | ## Critiques
|
| 477 |
|
| 478 | ### Regexes Are Hard To Read
|
| 479 |
|
| 480 | ... because the **same symbol can mean many things**.
|
| 481 |
|
| 482 | `^` could mean:
|
| 483 |
|
| 484 | - Start of the string/line
|
| 485 | - Negated character class like `[^abc]`
|
| 486 | - Literal character `^` like `[abc^]`
|
| 487 |
|
| 488 | `\` is used in:
|
| 489 |
|
| 490 | - Character classes like `\w` or `\d`
|
| 491 | - Zero-width assertions like `\b`
|
| 492 | - Escaped characters like `\n`
|
| 493 | - Quoted characters like `\+`
|
| 494 |
|
| 495 | `?` could mean:
|
| 496 |
|
| 497 | - optional: `a?`
|
| 498 | - lazy match: `a+?`
|
| 499 | - some other kind of grouping:
|
| 500 | - `(?P<named>\d+)`
|
| 501 | - `(?:noncapturing)`
|
| 502 |
|
| 503 | With egg expressions, each construct has a **distinct syntax**.
|
| 504 |
|
| 505 | ### YSH is Shorter Than Bash
|
| 506 |
|
| 507 | Bash:
|
| 508 |
|
| 509 | if [[ $x =~ '[[:digit:]]+' ]]; then
|
| 510 | echo 'x looks like a number
|
| 511 | fi
|
| 512 |
|
| 513 | Compare with YSH:
|
| 514 |
|
| 515 | if (x ~ /digit+/) {
|
| 516 | echo 'x looks like a number'
|
| 517 | }
|
| 518 |
|
| 519 | ### ... and Perl
|
| 520 |
|
| 521 | Perl:
|
| 522 |
|
| 523 | $x =~ /\d+/
|
| 524 |
|
| 525 | YSH:
|
| 526 |
|
| 527 | x ~ /d+/
|
| 528 |
|
| 529 |
|
| 530 | The Perl expression has three more punctuation characters:
|
| 531 |
|
| 532 | - YSH doesn't require sigils in expression mode
|
| 533 | - The match operator is `~`, not `=~`
|
| 534 | - Named character classes are unadorned like `d`. If that's too short, you can
|
| 535 | also write `digit`.
|
| 536 |
|
| 537 | ## Design Notes
|
| 538 |
|
| 539 | ### Eggexes In Other Languages
|
| 540 |
|
| 541 | The eggex syntax can be incorporated into other tools and shells. It's
|
| 542 | designed to be separate from YSH -- hence the separate name.
|
| 543 |
|
| 544 | Notes:
|
| 545 |
|
| 546 | - Single quoted string literals should **disallow** internal backslashes, and
|
| 547 | treat all other characters literally.. Instead, users can write `/ 'foo' \t
|
| 548 | 'sq' \' bar \n /` — i.e. implicit concatenation of strings and
|
| 549 | characters, described above.
|
| 550 | - To make eggexes portable between languages, Don't use the host language's
|
| 551 | syntax for string literals (at least for single-quoted strings).
|
| 552 |
|
| 553 | ### Backward Compatibility
|
| 554 |
|
| 555 | Eggexes aren't backward compatible in general, but they retain some legacy
|
| 556 | operators like `^ . $` to ease the transition. These expressions are valid
|
| 557 | eggexes **and** valid POSIX EREs:
|
| 558 |
|
| 559 | .*
|
| 560 | ^[0-9]+$
|
| 561 | ^.{1,3}|[0-9][0-9]?$
|
| 562 |
|
| 563 | ## FAQ
|
| 564 |
|
| 565 | ### The Name Sounds Funny.
|
| 566 |
|
| 567 | If "eggex" sounds too much like "regex" to you, simply say "egg expression".
|
| 568 | It won't be confused with "regular expression" or "regex".
|
| 569 |
|
| 570 | ### How Do Eggexes Compare with [Raku Regexes][raku-regex] and the [Rosie Pattern Language][rosie]?
|
| 571 |
|
| 572 | All three languages support pattern composition and have quoted literals. And
|
| 573 | they have the goal of improving upon Perl 5 regex syntax, which has made its
|
| 574 | way into every major programming language (Python, Java, C++, etc.)
|
| 575 |
|
| 576 | The main difference is that Eggexes are meant to be used with **existing**
|
| 577 | regex engines. For example, you translate them to a POSIX ERE, which is
|
| 578 | executed by `egrep` or `awk`. Or you translate them to a Perl-like syntax and
|
| 579 | use them in Python, JavaScript, Java, or C++ programs.
|
| 580 |
|
| 581 | Perl 6 and Rosie have their **own engines** that are more powerful than PCRE,
|
| 582 | Python, etc. That means they **cannot** be used this way.
|
| 583 |
|
| 584 | [rosie]: https://rosie-lang.org/
|
| 585 |
|
| 586 | [raku-regex]: https://docs.raku.org/language/regexes
|
| 587 |
|
| 588 | ### What About Eggex versus Parsing Expression Grammars? (PEGs)
|
| 589 |
|
| 590 | The short answer is that they can be complementary: PEGs are closer to
|
| 591 | **parsing**, while eggex and [regular languages]($xref:regular-language) are
|
| 592 | closer to **lexing**. Related:
|
| 593 |
|
| 594 | - [When Are Lexer Modes Useful?](https://www.oilshell.org/blog/2017/12/17.html)
|
| 595 | - [Why Lexing and Parsing Should Be
|
| 596 | Separate](https://github.com/oilshell/oil/wiki/Why-Lexing-and-Parsing-Should-Be-Separate) (wiki)
|
| 597 |
|
| 598 | The PEG model is more resource intensive, but it can recognize more languages,
|
| 599 | and it can recognize recursive structure (trees).
|
| 600 |
|
| 601 | ### Why Don't `dot`, `%start`, and `%end` Have More Precise Names?
|
| 602 |
|
| 603 | Because the meanings of `.` `^` and `$` are usually affected by regex engine
|
| 604 | flags, like `dotall`, `multiline`, and `unicode`.
|
| 605 |
|
| 606 | As a result, the names mean nothing more than "however your regex engine
|
| 607 | interprets `.` `^` and `$`".
|
| 608 |
|
| 609 | As mentioned in the "Philosophy" section above, eggex only does a superficial,
|
| 610 | one-to-one translation. It doesn't understand the details of which characters
|
| 611 | will be matched under which engine.
|
| 612 |
|
| 613 | ### Where Do I Send Feedback?
|
| 614 |
|
| 615 | Eggexes are implemented in YSH, but not yet set in stone.
|
| 616 |
|
| 617 | Please try them, as described in [this
|
| 618 | post](http://www.oilshell.org/blog/2019/08/22.html) and the
|
| 619 | [README]($oils-src:README.md), and send us feedback!
|
| 620 |
|
| 621 | You can create a new post on [/r/oilshell](https://www.reddit.com/r/oilshell/)
|
| 622 | or a new message on `#oil-discuss` on <https://oilshell.zulipchat.com/> (log in
|
| 623 | with Github, etc.)
|