| 1 | ---
|
| 2 | in_progress: yes
|
| 3 | default_highlighter: oils-sh
|
| 4 | ---
|
| 5 |
|
| 6 | HTM8 - An Easy Subset of HTML5, With Some Errors
|
| 7 | =================================
|
| 8 |
|
| 9 | HTM8 is a data language, which is part of J8 Notation:
|
| 10 |
|
| 11 | - It's a subset of HTML5, so there are **Syntax Errors**
|
| 12 | - It's "for humans"
|
| 13 | - `<li><li>` example
|
| 14 | - It's Easy
|
| 15 | - Easy to Implement - ~700 lines of regular languages and Python
|
| 16 | - And thus Easy to Remember, for users
|
| 17 | - Runs Efficiently - you don't have to materialize a big DOM tree, which
|
| 18 | causes many allocations
|
| 19 | - Convertible to XML?
|
| 20 | - without allocations, with a `sed`-like transformation!
|
| 21 | - low level lexing and matching
|
| 22 | - Ambitious
|
| 23 | - zero-alloc whitelist-based HTML filter for user content
|
| 24 | - zero-alloc browser and CSS-style content queries
|
| 25 |
|
| 26 | Currently, all of Oils docs are parsed and processed with it.
|
| 27 |
|
| 28 | We would like to "lift it up" into an API for YSH users.
|
| 29 |
|
| 30 | <!--
|
| 31 |
|
| 32 | TODO: 99.9% of HTML documents from CommonCrawl should be convertible to XML,
|
| 33 | and then validated by an XML parser
|
| 34 |
|
| 35 | - lxml - this is supposed to be high quality
|
| 36 |
|
| 37 | - Python stdlib uses expat - https://libexpat.github.io/
|
| 38 |
|
| 39 | - Gah it's this huge thing, 8K lines: https://github.com/libexpat/libexpat/blob/master/expat/lib/xmlparse.c
|
| 40 | - do they have the billion laughs bug?
|
| 41 |
|
| 42 | -->
|
| 43 |
|
| 44 | <div id="toc">
|
| 45 | </div>
|
| 46 |
|
| 47 | ## Structure of an HTM8 Doc
|
| 48 |
|
| 49 | ### Tags - Open, Close, Self-Closing
|
| 50 |
|
| 51 | 1. Open `<a>`
|
| 52 | 1. Close `</a>`
|
| 53 | 1. StartEnd `<img/>`
|
| 54 |
|
| 55 | HTML5 doesn't have the notion of self-closing tags. Instead, it silently ignores
|
| 56 | the trailing `/`.
|
| 57 |
|
| 58 | We are bringing it back for human, because we think it's too hard for people to
|
| 59 | remember the 16 void elements.
|
| 60 |
|
| 61 | And lack of balanced bugs causes visual bugs that are hard to debug. It would
|
| 62 | be better to get an error **earlier**.
|
| 63 |
|
| 64 | ### Attributes - Quotes optional
|
| 65 |
|
| 66 | 5 closely related Syntaxes
|
| 67 |
|
| 68 | 1. Missing `<a missing>`
|
| 69 | 1. Empty `<a empty=>`
|
| 70 | 1. Unquoted `<a href=foo>`
|
| 71 | 1. Double Quoted `<a href="foo">`
|
| 72 | 1. Single Quoted `<a href='foo'>`
|
| 73 |
|
| 74 | Note: `<a href=/>` is disallowed because it's ambiguous. Use `<a href="/">` or
|
| 75 | `<a href=/ >` or `<a href= />`.
|
| 76 |
|
| 77 | ### Text - Regular or CDATA
|
| 78 |
|
| 79 | #### Regular Text
|
| 80 |
|
| 81 | - Any UTF-8 text.
|
| 82 | - Generally, `& < > " '` should be escaped as `& < > " &apos`.
|
| 83 |
|
| 84 | But we are lenient and allow raw `>` between tags:
|
| 85 |
|
| 86 | <p> foo > bar </p>
|
| 87 |
|
| 88 | and raw `<` inside tags:
|
| 89 |
|
| 90 | <span foo="<" > foo </span>
|
| 91 |
|
| 92 | #### CDATA
|
| 93 |
|
| 94 | Like HTML5, we support explicit `<
|
| 283 | - [table-object-doc.html](table-object-doc.html)
|
| 284 |
|
| 285 |
|
| 286 | ## Brainstorming / TODO
|
| 287 |
|
| 288 | ### Foreign XML with `<svg>` and `<math>` ?
|
| 289 |
|
| 290 | `<svg>` and `<math>` are foreign XML content.
|
| 291 |
|
| 292 | We might want to support this.
|
| 293 |
|
| 294 | - So I can just switch to XML mode in that case
|
| 295 | - TODO: we need a test corpus for this!
|
| 296 | - maybe look for wikipedia content
|
| 297 | - can we also just disallow these? Can you make these into external XML files?
|
| 298 |
|
| 299 | This is one way:
|
| 300 |
|
| 301 | <object data="math.xml" type="application/mathml+xml"></object>
|
| 302 | <object data="drawing.xml" type="image/svg+xml"></object>
|
| 303 |
|
| 304 | Then we don't need special parsing?
|