| 1 | ---
|
| 2 | in_progress: yes
|
| 3 | default_highlighter: oils-sh
|
| 4 | ---
|
| 5 |
|
| 6 | Doc Processing in YSH - Notation, Query, Templating
|
| 7 | ====================================================
|
| 8 |
|
| 9 | This is a slogan for "maximalist YSH" design:
|
| 10 |
|
| 11 | *Documents, Objects, and Tables - HTML, JSON, and CSV* †
|
| 12 |
|
| 13 | This design doc is about the first part - **documents** and document processing.
|
| 14 |
|
| 15 | † from a paper about the C# language
|
| 16 |
|
| 17 | <div id="toc">
|
| 18 | </div>
|
| 19 |
|
| 20 | ## Intro
|
| 21 |
|
| 22 | Let's sketch a design for 3 aspects of doc processing:
|
| 23 |
|
| 24 | 1. HTM8 Notation - A **subset** of HTML5 meant for easy implementation, with
|
| 25 | regular languages.
|
| 26 | - It's part of J8 Notation (although it does not use J8 strings, like JSON8
|
| 27 | and TSV8 do.)
|
| 28 | - It's very important to understand that this is HTM8, not HTML8!
|
| 29 | 1. A subset of CSS for querying
|
| 30 | 1. Templating in the Markaby style (a bit like Lisp, but unlike JSX templates)
|
| 31 |
|
| 32 | The basic goal is to write ad hod HTML processors.
|
| 33 |
|
| 34 | YSH programs should loosely follow the style of the DOM API in web browsers,
|
| 35 | e.g. `document.querySelectorAll('table#mytable')` and the doc fragments it
|
| 36 | returns.
|
| 37 |
|
| 38 | Note that the DOM API is not available in node.js or Deno by default, much less
|
| 39 | any alternative lightweight JavaScript runtimes.
|
| 40 |
|
| 41 | I believe we can write include something that's simpler, and just as powerful,
|
| 42 | in YSH.
|
| 43 |
|
| 44 | ## Use Cases for HTML Processing
|
| 45 |
|
| 46 | These will help people get an idea.
|
| 47 |
|
| 48 | 1. making Oils cross-ref.html
|
| 49 | - query and replacement
|
| 50 | 1. table language - md-ul-table
|
| 51 | - query and replacement
|
| 52 | - many tables to make here
|
| 53 | 1. safe HTML subset, e.g. for publishing user results on continuous build
|
| 54 | - well I think I want to encode the policy, like
|
| 55 | - query
|
| 56 |
|
| 57 | Design goals:
|
| 58 |
|
| 59 | - Simple format that can be re-implemented anywhere
|
| 60 | - a few re2c expressions
|
| 61 | - Fast
|
| 62 | - re2c uses C
|
| 63 | - Few allocations
|
| 64 | - much simpler than an entire browser engine
|
| 65 |
|
| 66 | ## Operations
|
| 67 |
|
| 68 | - `doc('<p>')` - validates it and creates a value.Obj
|
| 69 | - `docQuery(mydoc, '#element')` - does a simple search
|
| 70 |
|
| 71 | Constructors:
|
| 72 |
|
| 73 | doc { # prints valid HT8
|
| 74 | p {
|
| 75 | echo 'hi'
|
| 76 | }
|
| 77 | p {
|
| 78 | 'hi' # I think I want to turn on this auto-quote feature
|
| 79 | }
|
| 80 | raw '<b>bold</b>'
|
| 81 | }
|
| 82 |
|
| 83 | And then
|
| 84 |
|
| 85 | doc (&mydoc) { # captures the output, and creates a value.Obj
|
| 86 | p {
|
| 87 | 'hi' # I think I want to turn on this auto-quote feature
|
| 88 | "hi $x"
|
| 89 | }
|
| 90 | }
|
| 91 |
|
| 92 | This is the same as the table constructor
|
| 93 |
|
| 94 | Module:
|
| 95 |
|
| 96 | source $LIB_YSH/doc.ysh
|
| 97 |
|
| 98 | doc (&d) {
|
| 99 | }
|
| 100 | doc {
|
| 101 | }
|
| 102 | doc('<p>')
|
| 103 |
|
| 104 | This can have both __invoke__ and __call__
|
| 105 |
|
| 106 | var results = d.query('#a')
|
| 107 |
|
| 108 | # The doc could be __invoke__ ?
|
| 109 | d query '#a' {
|
| 110 | }
|
| 111 |
|
| 112 | doc query (d, '#a') {
|
| 113 | for result in (results) {
|
| 114 | echo hi
|
| 115 | }
|
| 116 | }
|
| 117 |
|
| 118 | # we create (old, new) pairs?
|
| 119 | # this is performs an operation like:
|
| 120 | # d.outerHTML = outerHTML
|
| 121 | var d = d.replace(pairs)
|
| 122 |
|
| 123 |
|
| 124 | Safe HTML subset
|
| 125 |
|
| 126 | d query (tags= :|a p div h1 h2 h3|) {
|
| 127 | case (_frag.tag) {
|
| 128 | a {
|
| 129 | # get a list of all attributes
|
| 130 | var attrs = _frag.getAttributes()
|
| 131 | }
|
| 132 | }
|
| 133 | }
|
| 134 |
|
| 135 | If you want to take user HTML, then you first use an HTML5 -> HT8 converter.
|
| 136 |
|
| 137 | ## More Notes
|
| 138 |
|
| 139 | YSH API
|
| 140 |
|
| 141 | - Generating HTML/HTM8 is much more common than parsing it
|
| 142 | - although maybe we can do RemoveComments as a demo?
|
| 143 | - that is the lowest level "sed" model
|
| 144 |
|
| 145 | - For parsing, a minimum idea is:
|
| 146 | - lexer-based algorithms for query by tag, class name, and id
|
| 147 | - and then toTree() - this is a DOM
|
| 148 | - .tag and .attrs?
|
| 149 | - .innerHTML() and .outerHTML() perhaps
|
| 150 | - rewrite ul-table in that?
|
| 151 | - does that mean you mutate it, or construct text?
|
| 152 | - I think you can set the innerHTML probably
|
| 153 |
|
| 154 | - Testing of html.ysh aka htm8.ysh in the stdlib
|
| 155 |
|
| 156 | Cases:
|
| 157 |
|
| 158 | html 'hello <b>world</b>'
|
| 159 | html "hello <b>$name</b>"html
|
| 160 | html ["hello <b>$name</b>"] # hm this isn't bad, it's an unevaluated expression?
|
| 161 | commonmark 'hello **world**'
|
| 162 | md 'hello **world**'
|
| 163 | md ['hello **$escape**'] ? We don't have a good escaping algorithm
|
| 164 |
|
| 165 | ## Related
|
| 166 |
|
| 167 | - [table-object-doc.html](table-object-doc.html)
|
| 168 | - [htm8.html](htm8.html)
|