| 1 | ---
|
| 2 | default_highlighter: oils-sh
|
| 3 | in_progress: yes
|
| 4 | ---
|
| 5 |
|
| 6 | Notes on Unicode in Shell
|
| 7 | =========================
|
| 8 |
|
| 9 | <div id="toc">
|
| 10 | </div>
|
| 11 |
|
| 12 | ## Philosophy
|
| 13 |
|
| 14 | Oils is UTF-8 centric, unlike `bash` and other shells.
|
| 15 |
|
| 16 | That is, its Unicode support is like Go, Rust, Julia, and Swift, and not like
|
| 17 | Python or JavaScript. The former languages internally represent strings as
|
| 18 | UTF-8, while the latter use arrays of code points or UTF-16 code units.
|
| 19 |
|
| 20 | ## A Mental Model
|
| 21 |
|
| 22 | ### Program Encoding
|
| 23 |
|
| 24 | Shell **programs** should be encoded in UTF-8 (or its ASCII subset). Unicode
|
| 25 | characters can be encoded directly in the source:
|
| 26 |
|
| 27 | <pre>
|
| 28 | echo 'μ'
|
| 29 | </pre>
|
| 30 |
|
| 31 | or denoted in ASCII with C-escaped strings:
|
| 32 |
|
| 33 | echo $'\u03bc' # bash style
|
| 34 |
|
| 35 | echo u'\u{3bc}' # YSH style
|
| 36 |
|
| 37 | (Such strings are preferred over `echo -e` because they're statically parsed.)
|
| 38 |
|
| 39 | ### Data Encoding
|
| 40 |
|
| 41 | Strings in OSH are arbitrary sequences of **bytes**, which may be valid UTF-8.
|
| 42 | Details:
|
| 43 |
|
| 44 | - When passed to external programs, strings are truncated at the first `NUL`
|
| 45 | (`'\0'`) byte. This is a consequence of how Unix and C work.
|
| 46 | - Some operations like length `${#s}` and slicing `${s:1:3}` require the string
|
| 47 | to be **valid UTF-8**. Decoding errors are fatal if `shopt -s
|
| 48 | strict_word_eval` is on.
|
| 49 |
|
| 50 | ## List of Features That Respect Unicode
|
| 51 |
|
| 52 | ### OSH / bash
|
| 53 |
|
| 54 | These operations are currently implemented in Python, in `osh/string_ops.py`:
|
| 55 |
|
| 56 | - `${#s}` -- length in code points (buggy in bash)
|
| 57 | - Note: YSH `len(s)` returns a number of bytes, not code points.
|
| 58 | - `${s:1:2}` -- index and length are a number of code points
|
| 59 | - `${x#glob?}` and `${x##glob?}` (see below)
|
| 60 |
|
| 61 | More:
|
| 62 |
|
| 63 | - `${foo,}` and `${foo^}` for lowercase / uppercase
|
| 64 | - `[[ a < b ]]` and `[ a '<' b ]` for sorting
|
| 65 | - these can use libc `strcoll()`?
|
| 66 | - `printf '%d' \'c` where `c` is an arbitrary character. This is an obscure
|
| 67 | syntax for `ord()`, i.e. getting an integer from an encoded character.
|
| 68 |
|
| 69 | #### Globs
|
| 70 |
|
| 71 | Globs have character classes `[^a]` and `?`.
|
| 72 |
|
| 73 | This pattern results in a `glob()` call:
|
| 74 |
|
| 75 | echo my?glob
|
| 76 |
|
| 77 | These patterns result in `fnmatch()` calls:
|
| 78 |
|
| 79 | case $x in ?) echo 'one char' ;; esac
|
| 80 |
|
| 81 | [[ $x == ? ]]
|
| 82 |
|
| 83 | ${s#?} # remove one character suffix, quadratic loop for globs
|
| 84 |
|
| 85 | This uses our glob to ERE translator for *position* info:
|
| 86 |
|
| 87 | echo ${s/?/x}
|
| 88 |
|
| 89 | #### Regexes (ERE)
|
| 90 |
|
| 91 | Regexes have character classes `[^a]` and `.`:
|
| 92 |
|
| 93 | pat='.' # single "character"
|
| 94 | [[ $x =~ $pat ]]
|
| 95 |
|
| 96 | #### Locale-aware operations
|
| 97 |
|
| 98 | - Prompt string has time, which is locale-specific.
|
| 99 | - In bash, `printf` also has time.
|
| 100 |
|
| 101 | Other:
|
| 102 |
|
| 103 | - The prompt width is calculated with `wcswidth()`, which doesn't just count
|
| 104 | code points. It calculates the **display width** of characters, which is
|
| 105 | different in general.
|
| 106 |
|
| 107 | ### YSH
|
| 108 |
|
| 109 | - Eggex matching depends on ERE semantics.
|
| 110 | - `mystr ~ / [ \xff ] /`
|
| 111 | - `case (x) { / dot / }`
|
| 112 | - `for offset, rune in (runes(mystr))` decodes UTF-8, like Go
|
| 113 | - `Str.{trim,trimLeft,trimRight}` respect unicode space, like JavaScript does
|
| 114 | - `Str.{upper,lower}` also need unicode case folding
|
| 115 | - `split()` respects unicode space?
|
| 116 |
|
| 117 | Not unicode aware:
|
| 118 |
|
| 119 | - `strcmp()` does byte-wise and UTF-8 wise comparisons?
|
| 120 |
|
| 121 | ### Data Languages
|
| 122 |
|
| 123 | - Decoding JSON/J8 validates UTF-8
|
| 124 | - Encoding JSON/J8 decodes and validates UTF-8
|
| 125 | - So we can distinguish valid UTF-8 and invalid bytes like `\yff`
|
| 126 |
|
| 127 | ## Implementation Notes
|
| 128 |
|
| 129 | Unlike bash and CPython, Oils doesn't call `setlocale()`. (Although GNU
|
| 130 | readline may call it.)
|
| 131 |
|
| 132 | It's expected that your locale will respect UTF-8. This is true on most
|
| 133 | distros. If not, then some string operations will support UTF-8 and some
|
| 134 | won't.
|
| 135 |
|
| 136 | For example:
|
| 137 |
|
| 138 | - String length like `${#s}` is implemented in Oils code, not libc, so it will
|
| 139 | always respect UTF-8.
|
| 140 | - `[[ s =~ $pat ]]` is implemented with libc, so it is affected by the locale
|
| 141 | settings. Same with Oils `(x ~ pat)`.
|
| 142 |
|
| 143 | TODO: Oils should support `LANG=C` for some operations, but not `LANG=X` for
|
| 144 | other `X`.
|
| 145 |
|
| 146 | ### List of Low-Level UTF-8 Operations
|
| 147 |
|
| 148 | libc:
|
| 149 |
|
| 150 | - `glob()` and `fnmatch()`
|
| 151 | - `regexec()`
|
| 152 | - `strcoll()` respects `LC_COLLATE`, which bash probably does
|
| 153 |
|
| 154 | Our own:
|
| 155 |
|
| 156 | - Decode next rune from a position, or previous rune
|
| 157 | - `trimLeft()` and `${s#prefix}` need this
|
| 158 | - Decode UTF-8
|
| 159 | - J8 encoding and decoding need this
|
| 160 | - `for r in (runes(x))` needs this
|
| 161 | - respecting surrogate half
|
| 162 | - JSON needs this
|
| 163 | - Encode integer rune to UTF-8 sequence
|
| 164 | - J8 needs this, for `\u{3bc}` (currently in `data_lang/j8.py Utf8Encode()`)
|
| 165 |
|
| 166 | Not sure:
|
| 167 |
|
| 168 | - Case folding
|
| 169 | - both OSH and YSH have uppercase and lowercase
|
| 170 |
|
| 171 | ## Tips
|
| 172 |
|
| 173 | - The GNU `iconv` program converts text from one encoding to another.
|
| 174 |
|
| 175 | <!--
|
| 176 | ## Spec Tests
|
| 177 |
|
| 178 | June 2024 notes:
|
| 179 |
|
| 180 | - `spec/var-op-patsub` has failing cases, e.g. `LC_ALL=C`
|
| 181 | - ${s//?/a}
|
| 182 | - glob() and fnmatch() seem to be OK? As long as locale is UTF-8.
|
| 183 |
|
| 184 | -->
|
| 185 |
|
| 186 |
|
| 187 |
|
| 188 | <!--
|
| 189 |
|
| 190 | What libraries are we using?
|
| 191 |
|
| 192 | TODO: Make sure these are UTF-8 mode, regardless of LANG global variables?
|
| 193 |
|
| 194 | Or maybe we punt on that, and say Oils is only valid in UTF-8 mode? Need to
|
| 195 | investigate the API more.
|
| 196 |
|
| 197 | - fnmatch()
|
| 198 | - glob()
|
| 199 | - regcomp/regexec()
|
| 200 |
|
| 201 | - Are we using any re2c unicode? For JSON?
|
| 202 | - upper() and lower()? isupper() is lower()
|
| 203 | - Need to sort these out
|
| 204 |
|
| 205 | -->
|