1. Steve’s -to-MathML translator

1.1. Copyright notice

1.2. About the implementation

A novel feature of this program is that it uses its own mini-language to define the transformation from math to MathML. The skeptical reader might think that this is an attempt at over-engineering. Actually, the author would rather not employ a mini-language at all, but write the entire program in JavaScript --- but the use of the mini-language is essentially forced upon us, given our requirements. We need the program be usable on the client-side Web browser, which requires that the program be in JavaScript; on the other hand, we would also like the program to be available on the server side as well.

While JavaScript can run on the server-side, JavaScript implementations outside Web browsers are terrible — they are not widely available, and they do not hold a cradle to more standard scripting languages such as Perl or Python.

The other alternative is to write two versions of the transform rules, one for JavaScript, and one for Python, say. This is not pleasant, to say the least. And we should also consider that some people might want a PHP implementation, for example, if that is what they are using to generate Web pages on the server side.

Therefore, we should write a single version of the program in one language, then employ automatic translation of that program into the other computer languages.

A solution that avoids the mini-language is to translate JavaScript directly to the other languages. But this solution takes too much effort to implement.

It is better instead, to start with a minimalist mini-language, designed specifically to tackle the problem of to MathML translation, and translate that to the actual scripting languages. We require that the mini-language be easy to parse — and we achieve this goal by using the S-expression syntax from the Lisp languages (i.e. sub-expressions are wrapped in parentheses, and operations are written in prefix notation).

1.2.1. Why the mini-language is imperative

Although the use of S-expressions makes the mini-language superficially look like Scheme, it is fundamentally very different from Scheme. Our mini-language is a procedural or imperative language, rather than a functional language like Scheme. Its structure closely mirrors that of JavaScript (the first implementations of this program were in JavaScript). A procedural design is used, to make translation from the mini-language to JavaScript and Python simple.

On the other hand, if the mini-language is functional, then it would take a non-trivial amount of effort to translate it to imperative languages (one of the obstacles being to optimize tail recursion). Of course, making the mini-language imperative disfavors the functional languages. But this is less of a problem than disfavoring the imperative languages. Firstly, (pure) functional languages, such as Scheme and Haskell, are not as widely used as scripting languages, compared to the procedural languages such as Perl and Python. (This is no accident, since scripting languages are often used for one-time, small-scale hacks, where the function formalism becomes more of a hindrance.)

And secondly, it is actually slightly more complicated to write the -to-MathML transformation rules in a functional style than in an imperative style (yes, the author has tried it). This is because syntax is somewhat irregular, being intended to be read and written by humans, and most efficiencies of functional programming over imperative programming comes from exploiting the structure of data. Also, it should be noted that a certain naïve way of writing the transformation rules functionally will lead to an inefficient program with an execution time that is quadratic in the number of input tokens. (The author unwittingly fell into this trap on his original implementation.)

The mini-language in our program can be regarded as analogous to the specification formats that drive other data converters, which are usually declarative. However, a declarative specification format falls down in the present application, for the same reasons that functional programming does not well either — syntax is irregular enough that designing a declarative format that captures all its subtleties is hard.

1.2.2. How minimal should the mini-language be

The mini-language has to be minimal, for the obvious reason that small languages take less effort to translate. The less effort required, the better.

Although minimal often implies elegance in design, we are not writing this program in the mini-language for the sake of elegance. We are interested in:

  1. a way to describe algorithms for translating math to MathML; and
  2. working, practical implementations of those algorithms, on several platforms.

For us, the mini-language itself is mostly a distraction that has to be tolerated. (However, we cannot deny that it is a nice bonus that the transformation rules can be specified more concisely in the mini-language than in direct JavaScript.)

Thus, we will not “design” the mini-language as such, but simply tack on features only as they are needed to describe the algorithms. For instance, the mini-language currently has no support for arrays; if we want to write a transformation rule that uses arrays, then we simply add that to the language. And the array operations would be implemented by translating them to array operations in the underlying language (JavaScript, Python, etc.).

This strategy of ad-hoc, unprincipled extension can sometimes be disasterous, but there is little cause for concern in our case. Since the mini-language syntax is based on S-expressions, it can be easily extended without worrying about the parsing details. Also, the problem of -to-MathML conversion is small enough, and fairly restricted, so that we should not need to extend it very much: the input is math markup; the output is MathML markup. Or, as the saying goes: we do not need to expand our program to the point that it can read e-mail.

This strategy saves the author from wasting effort to design yet another extension language. We do need some implementation experience to know what features to include, and what features can be safely omitted — this was provided to the author by his first JavaScript-only implementation of this program.

It hardly needs to be said, but any glue code or interface code (e.g. interacting with the user, or the Web browser) need not, should not, and will not be written in the mini-language.

We will also point out that, because the mini-language is so minimal, sometimes the procedures written in it are forced to be written in a low-level way — looping over conditions, reading and advancing token pointers, etc. — somewhat like programming in Forth or assembly language. This mostly cannot be helped, because syntax is so varied that it is difficult to define higher-level abstractions to parse it.

1.2.3. Summary of the mini-language

As our mini-language is all about “anti-design” — the author refuses to even name the mini-language — naturally it is defined only by its implementations — that is, its translations to other computer languages. But to help the reader understand the use of the mini-language in this document, we briefly describe its syntax here:

Expressions are either variable (or procedure/function) references, or one of the following forms enclosed in parentheses. Expressions never have side effects except for the (call …) form (for the procedure being called may generate side effects).

(call f x1 x2…) call a procedure or function f with arguments x1, x2…, and evaluates to the return or result from f
(not x) logical negation of expression
(or x1 x2…) logical or (disjunction) of two or more expressions
(and x1 x2…) logical and (conjunction) of two or more expressions
(null? x) test for x being null
(not-null? x) test for x not being null
(result-element n x1 x2…) create MathML element (sub-tree) whose name is n, and whose children are x1, x2… in order.
(if t x y) test if t is true; if it is, evaluate expression x, otherwise evaluate expression y
(str-eq? x y) test if two strings are equal
(in? x A) test if x is a key in the associative array A
(get x A) return value which is paired with key x in the associative array A
(lambda x1 x2xn e) define an inline, anonymous function that returns the expression e, with the given arguments x1, x2
(read-token) get token at current pointer
(nil) the null object

The mini-language does not have lists or arrays. For the most common usage of lists in this application, namely, to build up content for MathML elements, direct MathML tree manipulation may be used instead. This turns out to be more efficient, and not unlike another popular transformation language, XSLT.

1.3. character definitions

This section gives the definitions for the “simple” and commands that produce a single mathematical symbol. To translate to MathML, each symbol must be mapped to its Unicode representation.

Punctuation and space
Delimiters
Operator symbols
Relation symbols
Named identifiers
Word operators
Greek letters

1.3.1. How the tables were derived

The majority of the following definitions were derived by taking the lists of characters, given as escapes, in the Short Math Guide for by the American Mathematical Society, and cross-mapping them with the STIX character table, which contains escapes for various Unicode characters used in mathematical expressions.

The lists of characters from the Short Guide to were entered by hand, as the author was not able to find a more official machine-readable list. Each list was entered as simple text files in the tables/ directory, with one escape on each line. For convenience, there is also a copy of the STIX character table in the tables/ directory.

A Python script and Unix shell script (using standard Unix commands like sort, cut, join and awk) together take the AMS- lists of characters and the STIX table as input, and output the mappings of the escapes to their Unicode characters. This output, in the format of HTML tables, can then be pasted in the XHTML source for this document.

The STIX character table was not used alone as the source, because it does not contain all the escapes, and it is not clear whether it categorizes the characters in the way that we want them to be used in the user script. It is better to double-check the character mappings, and add any missing characters, manually.

Any entries added directly to the JavaScript source will be separated out in their own sections, so that when we need to automatically generate the majority of the entries again, we will not clobber over the manual changes.

1.3.2. Ellipsis (dots) characters

\vdots 22ee ⋮
\hdots 2026 …
\ldots 2026 …
\dots 2026 …
\cdots 00B7 ·00B7 ·00B7 ·
\dotsb 00B7 ·00B7 ·00B7 ·
\dotsc 2026 …
\dotsi 22C5 ⋅22C5 ⋅22C5 ⋅
\dotsm 22C5 ⋅22C5 ⋅22C5 ⋅
\dotso 2026 …
\ddots 22F1 ⋱

1.3.3. Punctuation and space

Punctuation and space characters are to be displayed with the mtext element.

These were all added manually.

(var $punct-and-space (table See table below ))
\quad 2003  1 em space
\qquad 2003  2003  2 em space
\thickspace 2002  1⁄2 em space
\; 2002  1⁄2 em space
\medspace 2005  1⁄3 em space
\: 2005  1⁄3 em space
\thinspace 2004  1⁄4 em space
\, 2004  1⁄4 em space
\! 200B ​zero-width space
. .
; ;
? ?
\qedsymbol 25A0 ■

1.3.4. Delimiters

(var $left-delimiters (table Left delimiters))
(var $right-delimiters (table Right delimiters ))
( (
[ [
\{ {
\lgroup (
\lbrace {
\lvert 007c |
\lVert 2016 ‖
\lceil 2308 ⌈
\lfloor 230a ⌊
\lmoustache 23b0 ⎰
\langle 2329 〈
) )
] ]
\} }
\rbrace }
\rgroup )
\rvert 007c |
\rVert 2016 ‖
\rceil 2309 ⌉
\rfloor 230b ⌋
\rmoustache 23b1 ⎱
\rangle 232a 〉

1.3.5. Operator symbols

Operator symbols are to be displayed with the mo element.

(var $operator-symbols (table
  Operator symbols
  More operator symbols
  Extensible vertical arrows
  Big operators
  Big operators without limit style
  Miscellaneous simple symbols
  Other alphabetic symbols
))

1.3.5.1. operator symbols

This listing comes from tables/tex-ops.txt, which in turn comes from the list under “3.7 Binary Operator Symbols”, p. 6, in the AMS Guide.

\amalg2A3F ⨿
\ast002A *
\ast2217 ∗
\barwedge22BC ⊼
\barwedge2305 ⌅
\bigcirc25CB ○
\bigtriangledown25BD ▽
\bigtriangleup25B3 △
\boxdot22A1 ⊡
\boxminus229F ⊟
\boxplus229E ⊞
\boxtimes22A0 ⊠
\bullet2022 •
\bullet2219 ∙
\cap2229 ∩
\Cap22D2 ⋒
\cdot22C5 ⋅
\centerdot00B7 ·
\circ2218 ∘
\circledast229B ⊛
\circledcirc229A ⊚
\circleddash229D ⊝
\cup222A ∪
\Cup22D3 ⋓
\curlyvee22CE ⋎
\curlywedge22CF ⋏
\dagger2020 †
\ddagger2021 ‡
\diamond22C4 ⋄
\div00F7 ÷
\divideontimes22C7 ⋇
\dotplus2214 ∔
\doublebarwedge2306 ⌆
\doublecap22D2 ⋒
\doublecup22D3 ⋓
\gtrdot22D7 ⋗
\intercal22BA ⊺
\land2227 ∧
\leftthreetimes22CB ⋋
\lessdot22D6 ⋖
\lor2228 ∨
\ltimes22C9 ⋉
\mp2213 ∓
\odot2299 ⊙
\ominus2296 ⊖
\oplus2295 ⊕
\oslash2298 ⊘
\otimes2297 ⊗
\pm00B1 ±
\rightthreetimes22CC ⋌
\rtimes22CA ⋊
\setminus2216 ∖
\smallsetminus2216 ∖
\sqcap2293 ⊓
\sqcup2294 ⊔
\star22C6 ⋆
\times00D7 ×
\triangleleft25C1 ◁
\triangleright25B7 ▷
\uplus228E ⊎
\vee2228 ∨
\veebar22BB ⊻
\veebar2A61 ⩡
\wedge2227 ∧
\wr2240 ≀

1.3.5.2. Operator characters

++
-2212 −
**
, ,
/ 2215 ∕
: :
\colon :
| |
\vert 007c |
\Vert 2016 ‖
\| 2016 ‖
\backslash \
' 2032 ′
\# #
\bmod mod
\mod mod

1.3.5.3. Big operators

This listing comes from tables/tex-bigops.txt, which in turn comes from the list under “3.11 Cumulative (variable-size) operators”, p. 7, in the AMS Guide.

\bigcap22C2 ⋂
\bigcup22C3 ⋃
\bigodot2A00 ⨀
\bigoplus2A01 ⨁
\bigotimes2A02 ⨂
\bigsqcup2A06 ⨆
\biguplus2A04 ⨄
\bigvee22C1 ⋁
\bigwedge22C0 ⋀
\coprod2210 ∐
\prod220F ∏
\sum2211 ∑

The integral symbol is special from the others, in that even in display style, subscripts and superscripts are not in the limit style (for the obvious reason that it would not look good).

\int222B ∫
\smallint222B ∫
\oint222E ∮

This listing comes from tables/tex-ext-arrows.txt, which in turn comes from the list under “3.14 Extensible vertical arrows”, p. 8, in the AMS Guide.

\downarrow2193 ↓
\Downarrow21D3 ⇓
\uparrow2191 ↑
\Uparrow21D1 ⇑
\updownarrow2195 ↕
\Updownarrow21D5 ⇕

1.3.5.5. Miscellaneous simple symbols

This listing comes from tables/tex-miscsym.txt, which in turn comes from the list under “3.6 Miscellaneous simple symbols”, p. 5, in the AMS Guide.

\angle2220 ∠
\backprime2035 ‵
\bigstar2605 ★
\blacklozenge29EB ⧫
\blacksquare25A0 ■
\blacksquare25AA ▪
\blacktriangle25B4 ▴
\blacktriangledown25BE ▾
\bot22A5 ⊥
\clubsuit2663 ♣
\diagdown2572 ╲
\diagup2571 ╱
\diamondsuit2662 ♢
\emptyset2205 ∅
\exists2203 ∃
\flat266D ♭
\forall2200 ∀
\heartsuit2661 ♡
\infty221E ∞
\lnot00AC ¬
\lozenge25CA ◊
\measuredangle2221 ∡
\nabla2207 ∇
\natural266E ♮
\neg00AC ¬
\nexists2204 ∄
\prime2032 ′
\sharp266F ♯
\spadesuit2660 ♠
\sphericalangle2222 ∢
\square25A1 □
\surd221A √
\top22A4 ⊤
\triangle25B5 ▵
\triangledown25BF ▿
\varnothing2205 ∅

1.3.5.6. Other alphabetic symbols

\aleph2135 ℵ
\Bbbk1D55C 𝕜
\beth2136 ℶ
\circledS24C8 Ⓢ
\complement2201 ∁
\daleth2138 ℸ
\ell2113 ℓ
\eth00F0 ð
\Finv2132 Ⅎ
\Game2141 ⅁
\gimel2137 ℷ
\hbar210F ℏ
\hslash210F ℏ
\Im2111 ℑ
\mho2127 ℧
\partial2202 ∂
\Re211C ℜ
\wp2118 ℘

1.3.6. Relation symbols

(var $relation-symbols (table
  ("=" "=")
  ("<" "<")
  (">" ">")

  Comparison symbols
  Arrows
  Miscellaneous
))

1.3.6.1. Comparison symbols

This listing comes from tables/tex-relsym.txt, which in turn comes from the list under “3.8 Relation Symbols”, p. 6, in the AMS Guide.

Equality, inequality, etc.

\approx2248 ≈
\approxeq224A ≊
\asymp224D ≍
\backsim223D ∽
\backsimeq22CD ⋍
\bumpeq224F ≏
\Bumpeq224E ≎
\circeq2257 ≗
\cong2245 ≅
\curlyeqprec22DE ⋞
\curlyeqsucc22DF ⋟
\doteq2250 ≐
\doteqdot2251 ≑
\eqcirc2256 ≖
\eqsim2242 ≂
\eqslantgtr2A96 ⪖
\eqslantless2A95 ⪕
\equiv2261 ≡
\fallingdotseq2252 ≒
\ge2265 ≥
\geq2265 ≥
\geqq2267 ≧
\geqslant2A7E ⩾
\gg226B ≫
\gg2AA2 ⪢
\ggg22D9 ⋙
\gggtr22D9 ⋙
\gnapprox2A8A ⪊
\gneq2A88 ⪈
\gneqq2269 ≩
\gnsim22E7 ⋧
\gtrapprox2A86 ⪆
\gtreqless22DB ⋛
\gtreqqless2A8C ⪌
\gtrless2277 ≷
\gtrsim2273 ≳
\gvertneqq2269 ≩
\le2264 ≤
\leq2264 ≤
\leqq2266 ≦
\leqslant2A7D ⩽
\lessapprox2A85 ⪅
\lesseqgtr22DA ⋚
\lesseqqgtr2A8B ⪋
\lessgtr2276 ≶
\lesssim2272 ≲
\ll226A ≪
\llless22D8 ⋘
\lnapprox2A89 ⪉
\lneq2A87 ⪇
\lneqq2268 ≨
\lnsim22E6 ⋦
\lvertneqq2268 ≨
\ncong2247 ≇
\ne2260 ≠
\neq2260 ≠
\ngeq2271 ≱
\ngeqq2267 ≧
\ngeqslant2A7E ⩾
\ngtr226F ≯
\nleq2270 ≰
\nleqq2266 ≦
\nleqslant2A7D ⩽
\nless226E ≮
\nprec2280 ⊀
\npreceq2AAF ⪯
\nsim2241 ≁
\nsucc2281 ⊁
\nsucceq2AB0 ⪰
\prec227A ≺
\precapprox2AB7 ⪷
\preccurlyeq227C ≼
\preceq2AAF ⪯
\precnapprox2AB9 ⪹
\precneqq2AB5 ⪵
\precnsim22E8 ⋨
\precsim227E ≾
\risingdotseq2253 ≓
\sim223C ∼
\simeq2243 ≃
\succ227B ≻
\succapprox2AB8 ⪸
\succcurlyeq227D ≽
\succeq2AB0 ⪰
\succnapprox2ABA ⪺
\succneqq2AB6 ⪶
\succnsim22E9 ⋩
\succsim227F ≿
\thickapprox2248 ≈
\thicksim223C ∼
\triangleq225C ≜

1.3.6.2. Miscellaneous

This listing comes from tables/tex-relmisc.txt, which in turn comes from the list under “3.10 Relation Symbols: Miscellaneous”, p. 7, in the AMS Guide.

\backepsilon03F6 ϶
\because2235 ∵
\between226C ≬
\blacktriangleleft25C0 ◀
\blacktriangleright25B6 ▶
\bowtie22C8 ⋈
\dashv22A3 ⊣
\frown2323 ⌣
\in220A ∊
\mid2223 ∣
\models22A7 ⊧
\ni220B ∋
\ni220D ∍
\nmid2224 ∤
\notin2209 ∉
\nparallel2226 ∦
\nshortmid2224 ∤
\nshortparallel2226 ∦
\nsubseteq2286 ⊆
\nsubseteq2288 ⊈
\nsubseteqq2AC5 ⫅
\nsupseteq2287 ⊇
\nsupseteq2289 ⊉
\nsupseteqq2AC6 ⫆
\ntriangleleft22EA ⋪
\ntrianglelefteq22EC ⋬
\ntriangleright22EB ⋫
\ntrianglerighteq22ED ⋭
\nvdash22AC ⊬
\nvDash22AD ⊭
\nVdash22AE ⊮
\nVDash22AF ⊯
\owns220D ∍
\parallel2225 ∥
\perp22A5 ⊥
\pitchfork22D4 ⋔
\propto221D ∝
\shortmid2223 ∣
\shortparallel2225 ∥
\smallfrown2322 ⌢
\smallsmile2323 ⌣
\smile2323 ⌣
\sqsubset228F ⊏
\sqsubseteq2291 ⊑
\sqsupset2290 ⊐
\sqsupseteq2292 ⊒
\subset2282 ⊂
\Subset22D0 ⋐
\subseteq2286 ⊆
\subseteqq2AC5 ⫅
\subsetneq228A ⊊
\subsetneqq2ACB ⫋
\supset2283 ⊃
\Supset22D1 ⋑
\supseteq2287 ⊇
\supseteqq2AC6 ⫆
\supsetneq228B ⊋
\supsetneqq2ACC ⫌
\therefore2234 ∴
\trianglelefteq22B4 ⊴
\trianglerighteq22B5 ⊵
\varpropto221D ∝
\varsubsetneq228A ⊊
\varsubsetneqq2ACB ⫋
\varsupsetneq228B ⊋
\varsupsetneqq2ACC ⫌
\vartriangle25B5 ▵
\vartriangleleft22B2 ⊲
\vartriangleright22B3 ⊳
\vdash22A2 ⊢
\vDash22A8 ⊨
\Vdash22A9 ⊩
\Vvdash22AA ⊪

1.3.6.3. Arrows

This listing comes from tables/tex-arrows.txt, which in turn comes from the list under “3.9 Relation Symbols: Arrows”, p. 7, in the AMS Guide.

\curvearrowleft21B6 ↶
\curvearrowright21B7 ↷
\downdownarrows21CA ⇊
\downharpoonleft21C3 ⇃
\downharpoonright21C2 ⇂
\gets2190 ←
\hookleftarrow21A9 ↩
\hookrightarrow21AA ↪
\leftarrow2190 ←
\Leftarrow21D0 ⇐
\leftarrowtail21A2 ↢
\leftharpoondown21BD ↽
\leftharpoonup21BC ↼
\leftleftarrows21C7 ⇇
\leftrightarrow2194 ↔
\leftrightarrows21C6 ⇆
\leftrightharpoons21CB ⇋
\leftrightsquigarrow21AD ↭
\Lleftarrow21DA ⇚
\longleftarrow27F5 ⟵
\Longleftarrow27F8 ⟸
\longleftrightarrow27F7 ⟷
\Longleftrightarrow27FA ⟺
\looparrowleft21AB ↫
\looparrowright21AC ↬
\Lsh21B0 ↰
\mapsto21A6 ↦
\multimap22B8 ⊸
\nearrow2197 ↗
\nleftarrow219A ↚
\nLeftarrow21CD ⇍
\nleftrightarrow21AE ↮
\nLeftrightarrow21CE ⇎
\nrightarrow219B ↛
\nRightarrow21CF ⇏
\nwarrow2196 ↖
\restriction21BE ↾
\rightarrow2192 →
\Rightarrow21D2 ⇒
\rightarrowtail21A3 ↣
\rightharpoondown21C1 ⇁
\rightharpoonup21C0 ⇀
\rightleftarrows21C4 ⇄
\rightleftharpoons21CC ⇌
\rightrightarrows21C9 ⇉
\rightsquigarrow219D ↝
\Rrightarrow21DB ⇛
\Rsh21B1 ↱
\searrow2198 ↘
\swarrow2199 ↙
\to2192 →
\twoheadleftarrow219E ↞
\twoheadrightarrow21A0 ↠
\upharpoonleft21BF ↿
\upharpoonright21BE ↾
\upuparrows21C8 ⇈

1.3.7. Named identifiers

Named identifiers include all Greek letters, named functions and operators, and other alphabetic identifiers that are not single Roman letters.

These are to be displayed using the mi element.

(var $named-identifiers (table
  Word operators
  Big word operators
  Greek letters
  Roman letters
  Ellipsis characters
))

1.3.7.1. Word operators

(var $word-operators (table Word operators ))
(var $big-word-operators (table Big word operators))

This listing comes from “3.17 Named operators”, p. 8, in the AMS Guide.

These commands represent common operators and functions written with several letters (e.g. the sine function). Obviously, the STIX table generally contains only single-character mappings and not these commands, so we must type enter all of them in manually. The mapping is trivial.

\arccos arccos
\arcsin arcsin
\arctan arctan
\arg arg
\cos cos
\cosh cosh
\cot cot
\coth coth
\csc csc
\deg deg
\det det
\dim dim
\exp exp
\gcd gcd
\hom hom
\ker ker
\lg lg
\ln ln
\log log
\Pr Pr
\sec sec
\sin sin
\sinh sinh
\tan tan
\tanh tanh
\inf inf
\injlim inj lim
\lim lim
\liminf lim inf
\limsup lum sup
\max max
\min min
\projlim proj lim
\sup sup

1.3.7.2. Greek letters

(var $greek-letters (table See table below))
\alpha03B1 α
\beta03B2 β
\chi03C7 χ
\delta03B4 δ
\Delta0394 Δ
\digamma03DD ϝ
\epsilon03F5 ϵ
\eta03B7 η
\gamma03B3 γ
\Gamma0393 Γ
\iota03B9 ι
\kappa03BA κ
\lambda03BB λ
\Lambda039B Λ
\mu03BC μ
\nu03BD ν
\omega03C9 ω
\Omega03A9 Ω
\phi03C6 φ
\Phi03A6 Φ
\pi03C0 π
\Pi03A0 Π
\psi03C8 ψ
\Psi03A8 Ψ
\rho03C1 ρ
\sigma03C3 σ
\Sigma03A3 Σ
\tau03C4 τ
\theta03B8 θ
\Theta0398 Θ
\upsilon03C5 υ
\Upsilon03D2 ϒ
\varepsilon03B5 ε
\varkappa03F0 ϰ
\varphi03D5 ϕ
\varpi03D6 ϖ
\varrho03F1 ϱ
\varsigma03C2 ς
\vartheta03D1 ϑ
\xi03BE ξ
\Xi039E Ξ
\zeta03B6 ζ

1.3.7.3. Roman letters

Although the Roman letters obviously map to themselves, we still list them explicitly, so that we do not have to add a special case (check for Roman letters) to the transformation rules.

aa
bb
cc
dd
ee
ff
gg
hh
ii
jj
kk
ll
mm
nn
oo
pp
qq
rr
ss
tt
uu
vv
ww
xx
yy
zz
AA
BB
CC
DD
EE
FF
GG
HH
II
JJ
KK
LL
MM
NN
OO
PP
QQ
RR
SS
TT
UU
VV
WW
XX
YY
ZZ

1.4. Complex commands

By a complex command, we mean a TeX command that is not simple; that is, the TeX command does more than just output a character or string of characters.

Complex commands include any sort of TeX command that takes arguments.

Fractions
Binomial coefficients
Square roots and radicals
Parenthesized mod
User-named functions and operators
Setting display style
Setting display style: displaymath
Changing math font styles
Changing math fonts (old-style commands)
Changing math font sizes
Accents on characters
Matrices
Array environment
Matrices: mtable subroutine
Under- and over-decorations
Matching sizes of delimiters
Matching sizes of delimiters: get delimiter subroutine

 blocks
 blocks: end

Combining operators
Character escapes

Embedded text

List of all commands
List of all  blocks

1.4.1. List of all commands

(var $tex-commands (table
  ("\\frac"         fraction-to-mathml )
  ("\\dfrac"        fraction-to-mathml )
  ("\\tfrac"        fraction-to-mathml )
  ("\\binom"        binom-to-mathml )
  ("\\sqrt"         sqrt-to-mathml )
  ("\\operatorname" operatorname-to-mathml )
  ("\\displaystyle" displaystyle-to-mathml )

  Parenthesized mod
  Changing math font styles
  Changing math fonts (old-style commands)
  Changing font sizes

  Accents on characters
  Under- and over- decorations
  Combining operators
  Matching sizes of delimiters

  ("\\char"     char-escape-to-mathml)
  ("\\!"        (lambda () (nil)))

  Embedded text commands

  ("\\begin"    latex-block-to-mathml)
))

1.4.2. LaTeX environments

A LaTeX environment is a special form of commands, where the logical argument is enclosed in between two tags \begin{name} and \end{name}, instead of the braces { and }.

A LaTeX environment is also called a LaTeX block (a more intuitive term).

The following is a list of all the LaTeX blocks we can parse.

(var $tex-environments (table
  ("smallmatrix"  (lambda () (call matrix-to-mathml "("      ")"     ) ))
  ("pmatrix"  (lambda () (call matrix-to-mathml "("      ")"     ) ))
  ("bmatrix"  (lambda () (call matrix-to-mathml "["      "]"     ) ))
  ("Bmatrix"  (lambda () (call matrix-to-mathml "{"      "}"     ) ))
  ("vmatrix"  (lambda () (call matrix-to-mathml "\u007c" "\u007c") )) 
  ("Vmatrix"  (lambda () (call matrix-to-mathml "\u2016" "\u2016") ))
  ("cases"    (lambda () (call matrix-to-mathml "{"      (nil)   ) ))

  ("array"    array-to-mathml  )

  ("displaymath" displaymath-to-mathml)
))

1.4.2.1. \begin dispatch

The real command name in \begin{name} is name, so we need the following function to extract this name, and dispatch to the procedure that handles that command specifically.

(procedure (latex-block-to-mathml)
  (set cmd (read-token))
  
  (cond ((in? cmd $tex-environments)
         (advance-token)
         (return (call (get cmd $tex-environments))))
        (else 
         (throw "unknown command"))))

1.4.2.2. \end dispatch

This procedure should be called when the \end{name} tag of a LaTeX block is encountered. It performs error-checking, and prepares for the parse after the \end{name} tag.

(procedure (finish-latex-block)
  (cond ((null? (read-token))
         (throw "unexpected eof")))
  (advance-token)
  (advance-token))

1.4.3. Fractions and binomial coefficients

(procedure (fraction-to-mathml)
  (var numerator   (call piece-to-mathml))
  (var denominator (call piece-to-mathml))
  (return (result-element "mfrac" (numerator denominator))))

The common “choose” notation for the binomial coefficients are essentially fractions with added parentheses but without the fraction bar, and is encoded that way in presentational MathML.

(procedure (binom-to-mathml)
  (var top    (call piece-to-mathml))
  (var bottom (call piece-to-mathml))
  (return (result-element "mrow"
    ((result-element "mo"    ( "("        ))
     (result-element "mfrac" (("linethickness" "0")) 
                             ( top bottom ))
     (result-element "mo"    ( ")"        ))))))

1.4.4. Square roots and radicals

A radical in TeX is written \sqrt[i]{x}, where i is the index of the radical, and x is the expression that appears under the radical.

The index part is optional; if it is omitted, then the radical is a square root and is translated to the MathML sqrt element. Otherwise, the MathML mroot element is used for the radical with index.

(procedure (sqrt-to-mathml)
  (var index  (call optional-arg-to-mathml))
  (var object (call piece-to-mathml))
  (cond ((not-null? index)
         (return (result-element "mroot" (object index))))
        (else
         (return (result-element "msqrt" (object))))))

1.4.5. Parenthesized mod

("\\pod"  (lambda () (call parenthesized-operator (nil))))
("\\pmod" (lambda () (call parenthesized-operator "mod")))
(procedure (parenthesized-operator word)
  (var object (call piece-to-mathml))
  (cond ((not-null? word)
         (return (result-element "mrow" 
                    ((result-element "mo" ("("))
                     (result-element "mo" (word))
                     object
                     (result-element "mo" (")"))))))
        (else
          (return (result-element "mrow" 
                    ((result-element "mo" ("("))
                     object
                     (result-element "mo" (")"))))))))

1.4.6. User-named functions and operators

Note that, by a special case in the tokenization, \operatorname always receives its argument as one token.

(procedure (operatorname-to-mathml)
  (var result (result-element "mo" ((read-token))))
  (advance-token)
  (return result))

1.4.7. Setting display style

When the \displaystyle command is encountered, everything following it should be typeset as a math “display”.

In MathML, presentational aspects such as the “display mode” and script size are controlled by attributes on the container element. However, in Mozilla, some of these attributes do not take effect unless they are applied to the MathML mstyle. This is a bug, but the work-around is simple: wrap the content around a mstyle element. The only disadvantage is that the output MathML becomes slightly more bloated.

(procedure (displaystyle-to-mathml)
  (var result (call subexpr-chain-to-mathml $hard-stop-tokens))
  (return (result-element "mstyle" (("displaystyle" "true")
                                    ("scriptlevel" "0")) (result))))

For reference, here is the procedure to generate a display without the mstyle work-around:

(procedure (displaystyle-to-mathml)
  (var result (call subexpr-chain-to-mathml $hard-stop-tokens))
  (set-attr result "displaystyle" "true")
  (set-attr result "scriptlevel"  "0")
  (return result)) 

Strictly speaking, the displaymath environment is not an environment to be used inside math mode. Rather, it activates math mode when used in text mode, and typesets the formula inside it as a “display”. However, this environment appears often in the markup generated by the LaTeX2HTML translator, and it is trivially supported.

(procedure (displaymath-to-mathml)
  (var result (call subexpr-chain-to-mathml $hard-stop-tokens))
  (call finish-latex-block)  
  (return (result-element "mstyle" (("displaystyle" "true")
                                    ("scriptlevel" "0")) (result))))

As with the displaystyle-to-mathml procedure, the version that outputs standard MathML without the mstyle work-around is as follows:

(procedure (displaymath-to-mathml)
  (var result (call subexpr-chain-to-mathml $hard-stop-tokens))
  (set-attr result "displaystyle" "true")
  (set-attr result "scriptlevel"  "0")
  (call finish-latex-block)  
  (return result))

1.4.8. Changing math font styles

  ("\\boldsymbol" (lambda () (call font-to-mathml "bold")) )
  ("\\bold"      (lambda () (call font-to-mathml "bold")) )
  ("\\Bbb"       (lambda () (call font-to-mathml "double-struck")) )
  ("\\mathbb"    (lambda () (call font-to-mathml "double-struck")) )
  ("\\mathbbmss" (lambda () (call font-to-mathml "double-struck")) )
  ("\\mathbf"    (lambda () (call font-to-mathml "bold")) )
  ("\\mathop"    (lambda () (call font-to-mathml "normal")) )
  ("\\mathrm"    (lambda () (call font-to-mathml "normal")) )
  ("\\mathfrak"  (lambda () (call font-to-mathml "fraktur")) )
  ("\\mathit"    (lambda () (call font-to-mathml "italic")) )
  ("\\mathscr"   (lambda () (call font-to-mathml "script")) )
  ("\\mathcal"   (lambda () (call font-to-mathml "script")) )
  ("\\mathsf"    (lambda () (call font-to-mathml "sans-serif")) )
  ("\\mathtt"    (lambda () (call font-to-mathml "monospace")) )
  ("\\EuScript"  (lambda () (call font-to-mathml "script")) )
(procedure (font-to-mathml font-name)
  (cond ((str-neq? (read-token) "{")
         (var result (result-element "mi" (("mathvariant" font-name))
                                          ((read-token))))
         (cond ((str-eq? font-name "normal")
                (set-attr result "fontstyle" "normal")))
         (advance-token)
         (return result))

        (else
         (var result (call piece-to-mathml))
         (set-attr result "mathvariant" font-name)
         (cond ((str-eq? font-name "normal")
                (set-attr result "fontstyle" "normal")))
         (return result))))
        

1.4.9. Deprecated font commands

  ("\\bf"        (lambda () (call old-font-to-mathml "bold")) )
  ("\\rm"        (lambda () (call old-font-to-mathml "normal")) )
(procedure (old-font-to-mathml font-name)
  (return 
    (result-element "mstyle" 
        (("mathvariant" font-name)
         ("fontstyle" (if (str-eq? font-name "normal") "normal" (nil))))
        ((call subexpr-chain-to-mathml $hard-stop-tokens)))))

1.4.10. Changing math font sizes

  ("\\big"       (lambda () (call size-to-mathml "2" "2")))
  ("\\Big"       (lambda () (call size-to-mathml "3" "3")))
  ("\\bigg"      (lambda () (call size-to-mathml "4" "4")))
  ("\\Bigg"      (lambda () (call size-to-mathml "5" "5")))
(procedure (size-to-mathml min-size max-size)
  (var result (call piece-to-mathml))
  (set-attr result "minsize" min-size)
  (set-attr result "maxsize" max-size)
  (return result))

1.4.11. Matrices

In MathML, the inside of the matrix or array is encoded as an mtable element. The borders, parentheses, or brackets are encoded separately.

The procedure matrix-to-mtable converts the inside of a matrix into a mtable. It works in this straightforward manner:

The procedure starts the mtable with one row (mtr) containing one cell (mtd), which become the current row and current cell. As we parse TeX content, we add that content (translated to MathML) to the current cell. If & is encountered, indicating that we should break the cell, then we create a new cell for the current row, and the new cell becomes the current cell. If \\ is encountered, indicating a line break, then we insert a new row in the table, with a new cell, and the new row and the new cell become the current ones.

The procedure should be passed an empty mtable result element. It does not create one itself, because the caller may want to add some attributes to the mtable before processing the table contents.

(procedure (matrix-to-mtable mtable)
  (var mtr    (result-element "mtr"))
  (var mtd    (result-element "mtd"))
  (var token  (read-token))

  (append mtable mtr)
  (append mtr    mtd)

  (while (and (not-null? token) (str-neq? token "\\end"))
    (cond ((str-eq? token "\\\\")
           (set mtr (result-element "mtr"))
           (set mtd (result-element "mtd"))
           (append mtable mtr)
           (append mtr mtd)
           (advance-token))

          ((str-eq? token "&")
           (set mtd (result-element "mtd"))
           (append mtr mtd)
           (advance-token))

          (else
           (append mtd (call subexpr-chain-to-mathml $hard-stop-tokens))))

    (set token (read-token)))

  (call finish-latex-block)
  (return mtable))

The following procedure, matrix-to-mathml, handles parsing of matrix blocks. It adds the borders to the mtable obtained from the procedure matrix-to-mtable, with a mrow wrapper around the whole thing.

The border delimiters for matrix-to-mathml are specified by parameters. See block commands list for the actual parameters for each matrix or array command.

(procedure (matrix-to-mathml open-delim close-delim)
  (var mtable (call matrix-to-mtable (result-element "mtable")))

  (cond ((or (not-null? open-delim) (not-null? close-delim))
         (var mrow (result-element "mrow"))

         (cond ((not-null? open-delim)
                (append mrow (result-element "mo" (open-delim)))))
         (append mrow mtable)
         (cond ((not-null? close-delim)
                (append mrow (result-element "mo" (close-delim)))))

         (return mrow))

        (else
         (return mtable))))

The following procedure, array-to-mathml, converts array environments. As in , no delimiters are put around the array.

(procedure (array-to-mathml)

  (var mtable (result-element "mtable"))

  

Process the specifications for the column alignment in cells

(cond ((str-eq? (read-token) "{") (advance-token) (while (and (not-null? (read-token)) (str-neq? (read-token) "}")) (cond ((str-eq? (read-token) "c") (append-attr mtable "columnalign" "center ")) ((str-eq? (read-token) "l") (append-attr mtable "columnalign" "left ")) ((str-eq? (read-token) "r") (append-attr mtable "columnalign" "right "))) (advance-token)) (cond ((not-null? (read-token)) (advance-token)))))

Process the table itself

(return (call matrix-to-mtable mtable)))

1.4.12. Accents

  ("\\acute"     (lambda () (call accent-to-mathml "\u0301")) )
  ("\\grave"     (lambda () (call accent-to-mathml "\u0300")) )
  ("\\tilde"     (lambda () (call accent-to-mathml "\u0303")) )
  ("\\bar"       (lambda () (call accent-to-mathml "\u0304")) )
  ("\\breve"     (lambda () (call accent-to-mathml "\u0306")) )
  ("\\check"     (lambda () (call accent-to-mathml "\u030c")) )
  ("\\hat"       (lambda () (call accent-to-mathml "\u0302")) )
  ("\\vec"       (lambda () (call accent-to-mathml "\u20d7")) )
  ("\\dot"       (lambda () (call accent-to-mathml "\u0307")) )
  ("\\ddot"      (lambda () (call accent-to-mathml "\u0308")) )
  ("\\dddot"     (lambda () (call accent-to-mathml "\u20db")) )

Accents are displayed in MathML by placing the corresponding Unicode character in an mover element, over the base letter, and with the accent attribute set to true.

(procedure (accent-to-mathml char)
  (return (result-element "mover" (("accent" "true"))
                                  ((call piece-to-mathml) 
                                   (result-element "mo" (char))))))

1.4.13. Under- and over- decorations

  ("\\underbrace" (lambda () (call under-to-mathml "\ufe38")) )
  ("\\overbrace"  (lambda () (call over-to-mathml  "\ufe37")) )
  ("\\underline"  (lambda () (call under-to-mathml "\u0332")) )
  ("\\overline"   (lambda () (call over-to-mathml  "\u00af")) )
  ("\\widetilde"  (lambda () (call over-to-mathml  "\u0303")) )
  ("\\widehat"    (lambda () (call over-to-mathml  "\u0302")) )
(procedure (over-to-mathml char)
  (return (result-element "mover" ((call piece-to-mathml) 
                                   (result-element "mo" (char))))))

(procedure (under-to-mathml char)
  (return (result-element "munder" ((call piece-to-mathml) 
                                    (result-element "mo" (char))))))

1.4.14. Combining operators

Combining operators refer to commands such as \not which draws a slash through the next operator. In general, it may refer to any operator that modifies the next one by superimposing some other mark over the glyph for the next operator.

The term “combining” is from the Unicode standards; a “combining character” (or “combining diacritic”) modifies the base character in some way. The only difference is that in Unicode, the combining character or diacritic comes after the base character rather than before it.

  ("\\not"      (lambda () (call combining-to-mathml 0338 ̸)))
(procedure (combining-to-mathml char)
  (var base (read-token))
  (advance-token)
  (return
    (result-element "mo" (base char))))

1.4.15. Matching sizes of delimiters

  ("\\left"     (lambda () (call delimiter-to-mathml "\\right" "1" (nil))) )
  ("\\bigl"     (lambda () (call delimiter-to-mathml "\\bigr"  "2" "2" )) )
  ("\\Bigl"     (lambda () (call delimiter-to-mathml "\\Bigr"  "3" "3" )) )
  ("\\biggl"    (lambda () (call delimiter-to-mathml "\\biggr" "4" "4" )) )
  ("\\Biggl"    (lambda () (call delimiter-to-mathml "\\Biggr" "5" "5" )) )

The commands for matching sizes of delimiters are unusual in that they do not take arguments enclosed between { and }, but the character that occurs right after the command. Also the command for the left delimiter must be matched by the command for the right delimiter. So we must know to stop parsing when we encounter the command for the right delimiter.

(procedure (delimiter-to-mathml end-command min-size max-size)
  (var mrow (result-element "mrow"))
  (append mrow (result-element "mo" (("minsize" min-size)
                                     ("maxsize" max-size))
                                    ((call read-delimiter))))

  (append mrow (call subexpr-chain-to-mathml $hard-stop-tokens))

  (cond ((str-neq? (read-token) end-command)
         (return mrow)))
  (advance-token)

  (append mrow (result-element "mo" (("minsize" min-size)
                                     ("maxsize" max-size))
                                    ((call read-delimiter))))

  (return mrow))

This subroutine reads the next token, checks that it is a valid delimiter, and handles the special cases:

\left.produces a blank instead of a dot
\right.
\left<produces the left-pointing angle bracket instead of a less-than sign
\right<
\right>produces the right-pointing bracket instead of a greater-than sign
\left>
(procedure (read-delimiter)
  (var token (read-token))
  (cond ((null? token)
         (throw "unexpected eof"))
        ((str-eq? token ".")
         (advance-token)
         (return ""))
        ((str-eq? token "<")
         (advance-token)
         (return "\u2329"))
        ((str-eq? token ">")
         (advance-token)
         (return "\u232a"))
        ((in? token $punct-and-space)
         (advance-token)
         (return (get token $punct-and-space)))
        ((in? token $left-delimiters)
         (advance-token)
         (return (get token $left-delimiters)))
        ((in? token $right-delimiters)
         (advance-token)
         (return (get token $right-delimiters)))
        ((in? token $operator-symbols)
         (advance-token)
         (return (get token $operator-symbols)))
        (else
         (throw "invalid delimiter"))))

1.4.16. Character escapes

This is weird stuff that comes out from LaTeX2HTML. Normally the TeX typist should not enter character codes for any character.

(var $char-escape-codes (table 
  ("93" "#")))

(procedure (char-escape-to-mathml)
  (var result (nil))
  (cond ((in? (read-token) $char-escape-codes)
         (set result (result-element "mtext" 
                        ((get (read-token) $char-escape-codes)))))
        (else
         (set result (result-element "merror" ("\\char" (read-token))))))

  (advance-token)
  (return result))

1.4.17. Embedded text

  ("\\text"       text-to-mathml )
  ("\\textnormal" text-to-mathml )
  ("\\textrm"     text-to-mathml )
  ("\\textsl"     text-to-mathml )
  ("\\textit"     text-to-mathml )
  ("\\texttt"     text-to-mathml )
  ("\\textbf"     text-to-mathml )
  ("\\hbox"       text-to-mathml )
  ("\\mbox"       text-to-mathml )

The embedded text is not split into words or tokens; it is split only when it is interrupted by an inline math formula. Each segment of text goes in a mtext element. Embedded inline math formulas in the embedded text are signaled, as in , by $ tokens.

When $ is encountered, normal math parsing begins again, until interrupted by another $ (at the same nesting level). The parsing knows when to stop because $ is considered a “hard stop token”.

(procedure (text-to-mathml)
  (cond ((str-neq? (read-token) "{")
         (var result (result-element "mtext" ((read-token))))
         (advance-token)
         (return result)))

  (advance-token)
    
  (var result (nil))
  (var mrow (nil))
  (var node (nil))

  (while (not-null? (read-token))
    (cond ((str-eq? (read-token) "}")
           (advance-token)
           (return result))

          ((str-eq? (read-token) "$")
           (advance-token)
           (set node (call subexpr-chain-to-mathml $hard-stop-tokens))
           (advance-token))

          (else 
           (set node (result-element "mtext" ((read-token))))
           (advance-token)))

    Collect results)

  (return result))

1.5. Splitting the input into tokens

At the most fundamental level, markup is parsed by reading the markup one character at a time, and taking actions based on what that character is. This is the approach taken by the typesetting system itself. So, for example, the string typeset means, in , to construct the glyphs for each character ( t y p e s e t ), and then paste them together to form (part of) a paragraph.

We do not take this approach in our parser, because, in most scripting languages, and in XML, processing text by looping through each individual character is inefficient. Also, some character sequences have to be processed together anyway — these are called “tokens”. Examples of tokens are: \catcode and 3.14 (as opposed to interpreting them as \ c a t c o d e or 3 . 1 4 ) .

A well-understood and widely implemented specification for matching character sequences is that of regular expressions. Our program considers the markup that is input as a stream of characters; the head of the character stream is matched against a regular expression. The matching characters are grouped into one token (or sometimes a few tokens), and then those characters are consumed (removed from the head of the input character stream). Much of the -to-MathML translator only deals with the tokens, not the character stream.

1.5.1. Regular expression for math mode

The following regular expression greedily matches sequences of characters to be considered as tokens in “math mode”. The matching sequences are grouped by parentheses.

(\\begin|\\operatorname|\\mathrm|\\mathop|\\end)\s*\{\s*([A-Z a-z]+)\s*\}match commands for environments
|(\\[a-zA-Z]+|\\[\\#\{\},:;!])match other commands
|(\s+)match blank token
|([0-9\.]+)match numbers
|([\$!"#%&'()*+,-.\/:;<=>?\[\]^_`\{\|\}~])match tokens for operators

For reasons that will be explained later, the regular expression just given omits the matching of identifiers (constants or variables denoted by Roman letters). Assuming, for the moment, that such identifiers are simply the letters of the Roman alphabet, we can give an example of applying this regular expression. Using it, the markup \sin(\frac 1 {xy^2}) (in math mode) will be split into these tokens:

\sin ( \frac 1 { x y ^ 2 } )

Here, for clarity, the blank tokens consisting of only spaces have been ignored, because blank tokens will eventually be displayed as nothing anyway.

1.5.2. Regular expression for text mode

The regular expression for math mode works well for mathematics written in that do not require switching out of “math mode”. To elaborate, some mathematical formulae make use of natural-language descriptions like this one:

B = \{ z \in A : \text{where $f(z)$ is purely imaginary} \}

In this case, the English text “where … is purely imaginary” must be parsed as English text, and not as the mathematical variables w, h, e, r, ….

has the notion of modes that distinguish between symbolic formulae and natural-language text. The former is handled by the “math mode”, and the latter by the “horizontal mode” or “paragraph mode”. Since always processes markup a character at a time, the modes do not affect the syntax of the markup, but rather they affect how the input characters are typeset into the resulting document.

However, our markup parser works differently; it parses a formulae written in into tokens using regular expressions, and the regular expressions are not valid for parsing natural-language text. So to parse natural-language text embedded into mathematical formulae, we need a separate set of regular expressions to parse natural-language text. When embedded text is encountered we must switch into the separate set of regular expressions for parsing it.

Here is the regular expression for parsing natural-language text (outside of “math mode”), possibly with embedded commands:

[\${}\\]
|\\[a-zA-Z]+match commands
|[^{}\$]+

1.5.3. Integrated parser

So far, we have presented two sets of regular expressions. The two sets cannot be combined into one, because selecting the correct parsing mode depends on counting nesting levels of matching braces ({ and }) or math-mode delimiters ($). And regular expressions provably cannot do this — they do not provide recursively-defined subexpressions.

Thus, full markup cannot be parsed using regular expressions alone. It could be parsed with a context-free grammar instead. However, parsing a general context-free grammar, using a parser generator, is overkill for our purposes — parser generators are not readily available for the JavaScript language, and generators from scripting languages are likely to be less efficient from a hand-rolled parser.

Actually, a hand-rolled parser is not difficult to implement.

The text, textrm, textsl, textit, texttt, and textbf commands switch out of math mode and into text mode (’s horizontal mode to be precise. This means that whatever that occurs inside these commands cannot be tokenized as usual math tokens.

(\\textrm|\\textsl|\\textit|\\texttt|\\textbf|\\text|\\hbox)

1.5.3.1. JavaScript

const tokenize_re = /Regular expression for tokenizing  input|([a-zA-Z])/g;

const tokenize_text_re = /[\${}\\]|\\[a-zA-Z]+|[^{}\$]+/g;

const tokenize_text_commands = {
  '\\textrm': 1,
  '\\textsl': 1,
  '\\textit': 1,
  '\\texttt': 1,
  '\\textbf': 1,
  '\\textnormal': 1,
  '\\text': 1,
  '\\hbox': 1,
  '\\mbox': 1
};

function tokenize_latex_math(input)
{
  var result = new Array();
  var in_text_mode = 0;
  var brace_level = [];
  var pos = 0;

  if(input.charAt(0) == '$' && 
     input.charAt(input.length-1) == '$')
    input = input.slice(1, input.length-1);

  while(1) {
    if(!in_text_mode) {
      tokenize_re.lastIndex = pos;
      var m = tokenize_re.exec(input);
      pos = tokenize_re.lastIndex;

      if(m == null) {
        return result;
      } else if(m[1] != null) {
        result.push(m[1], m[2]);
      } else if(m[3] == '\\sp') {
        result.push('^');
      } else if(m[3] == '\\sb') {
        result.push('_');
      } else {
        if(m[0] == '$') {
          in_text_mode = 1;
        } else if(m[4] != null) {
          continue;
        } else if(m[3] != null && m[3] in tokenize_text_commands) {
          in_text_mode = 2;
          brace_level.push(0);
        } 

        result.push(m[0]);
      }
    } else {
      tokenize_text_re.lastIndex = pos;
      var m = tokenize_text_re.exec(input);
      pos = tokenize_text_re.lastIndex;
      
      if(m == null) {
        return result;
      } else if(m[0] == '$') {
        in_text_mode = 0;
      } else if(m[0] == '{') {
        brace_level[brace_level.length-1]++;
      } else if(m[0] == '}') {
        if(--brace_level[brace_level.length-1] <= 0) {
          in_text_mode = 0;
          brace_level.pop();
        }
      }
      result.push(m[0]);
    }
  }
}

1.5.3.2. Python

  tokenize_re = re.compile(ur"""Regular expression for tokenizing  input|([a-zA-Z])""")

  tokenize_text_re = re.compile(ur"""[\${}\\]|\\[a-zA-Z]+|[^{}\$]+""")

  tokenize_text_commands = {
    u'\\textrm': 1,
    u'\\textsl': 1,
    u'\\textit': 1,
    u'\\texttt': 1,
    u'\\textbf': 1,
    u'\\text': 1,
    u'\\textnormal': 1,
    u'\\hbox': 1,
    u'\\mbox': 1,
  }

  def tokenize_latex_math(self, input):

    in_text_mode = 0;
    brace_level = [];
    pos = 0;

    input = unicode(input)
    if input[0] == u'$' and input[-1] == u'$':
      input = input[1:-1]

    while True:
      if not in_text_mode:
        m = self.tokenize_re.match(input, pos)
        if m is None:
          return

        pos = m.end()
        
        if m.group(1) is not None:
          self.tokens.extend(m.group((1,2)))
        elif m.group(3) == u"\\sp":
          self.tokens.append(u"^")
        elif m.group(3) == u"\\sb":
          self.tokens.append(u"_")
        else:
          if m.group(0) == u"$":
            in_text_mode = 1
          elif m.group(4) is not None:
            continue
          elif m.group(3) in self.tokenize_text_commands:
            in_text_mode = 2;
            brace_level.append(0)

          self.tokens.append(m.group(0))

      else:
        m = self.tokenize_text_re.match(input, pos)
        if m is None:
          return
        
        pos = m.end()
        if m.group(0) == u"$":
          in_text_mode = 0
        elif m.group(0) == u"{":
          brace_level[-1] += 1
        elif m.group(0) == u"}":
          brace_level[-1] -= 1
          if brace_level[-1] <= 0:
            in_text_mode = 0
            brace_level.pop()
            
        self.tokens.append(m.group(0))

1.6. Parsing expressions

We parse and in two stages: the first stage splits the input string into tokens, then the transformation rules march through these tokens (generally from left to right), translating them to the MathML.

The transformation rules are written as procedures that call each other recursively to build up MathML sub-trees.

List of limit commands
Parse a piece
Parse a subexpression
Parse a subexpression chain
Parse optional arguments

1.6.1. Parse a piece

In our parlance, a piece is a token or a group of tokens that takes as one piece. Examples:

xy consists of two pieces, x and y, each piece being also a token
{xy} has only one piece consisting of 4 tokens, {xy}
2^nm The piece occurring after ^ is the token n
2^{nm} The piece occurring after ^ is {nm}, consisting of 4 tokens
2^\mathrm{F}_2 The piece occurring after ^ is \mathrm{F}, consisting of 4 tokens

The necessity of distinguishing between a piece and a token should be clear from these examples. In 2^mn, the superscript, as parsed by , is just m. To get an mn superscript, the two tokens must be enclosed as one piece: {mn}.

More precisely, a piece is defined to be one of the following:

  1. a token that is not { or } or a command token
  2. a series of tokens beginning with { and ending with a matching }
  3. a command token, that does not begin a block, along with all of its arguments
  4. a block including its logical argument

The following procedure, piece-to-mathml, takes the next piece and translates it to MathML. To do this, it reads in the next token, and determines the appropriate action depending on the type of the token.

(procedure (piece-to-mathml)
  (var token (read-token))
  (var result (nil))

  (cond ((str-eq? token "{")
         (advance-token)
         (set result (call subexpr-chain-to-mathml $hard-stop-tokens))
         (cond ((str-eq? (read-token) "}")
                (advance-token))))

        ((in? token $relation-symbols)
         (set result (result-element "mo" 
                        ((get token $relation-symbols))))
         (advance-token))

        ((in? token $operator-symbols)
         (set result (result-element "mo" 
                        ((get token $operator-symbols))))
         (advance-token))
        
        ((in? token $left-delimiters)
         (set result (result-element "mo" 
                        ((get token $left-delimiters))))
         (advance-token))
        
        ((in? token $right-delimiters)
         (set result (result-element "mo" 
                        ((get token $right-delimiters))))
         (advance-token))

        ((in? token $word-operators)
         (set result (result-element "mi" (("mathvariant" "normal"))
                                          ((get token $word-operators))))
         (advance-token))

        ((in? token $greek-letters)
         (set result (result-element "mi" (("fontstyle" "normal"))
                                          ((get token $greek-letters))))
         (advance-token))

        ((in? token $named-identifiers)
         (set result (result-element "mi"
                        ((get token $named-identifiers))))
         (advance-token))

        ((in? token $punct-and-space)
         (set result (result-element "mtext"
                        ((get token $punct-and-space))))
         (advance-token))

        ((in? token $tex-commands)
         (advance-token)
         (set result (call (get token $tex-commands))))

        (else
         (set result (result-element "mn" (token)))
         (advance-token)))

  (return result))

1.6.2. Differences from

In the concept that corresponds roughly to a piece is instead called an atom. However, “atom” is the wrong word for the concept, since “atom” means indivisible, but pieces can be groups of atoms.

The two terms piece and atom technically do not refer to the same thing, because of a technical difference in which we do parsing. Recall that we parse a character sequence consisting of digits (and decimal points) as one token representing a number, rather than as separate tokens for each digit. So for example, 431 is one token — and hence one piece — in our sense, while in there are 3 atoms: 4 3 1.

This can lead to incompatibilities when parsing markup. For example, in , the markup \frac42 6 means the same as \frac{4}{2} 6, and not \frac{42}{6}.

Also, does not allow blocks to occur whenever the context demands an atom. So for example, this markup is illegal:

x^\begin{smallmatrix} 1 & 2 \\ 3 & 4\end{smallmatrix},

although this is legal:

x^{\begin{smallmatrix} 1 & 2 \\ 3 & 4\end{smallmatrix}}.

In our parser both are legal — that simplifies the implementation slightly. Technically it constitutes an incompatibility, but only invalid markup is effected, so there is little to worry about.

1.6.3. Parse a subexpression

A subexpression, in our case, means a piece along with its attached subscripts and superscripts.

FIX ME. This definition will have to be expanded eventually, to allow semantic parsing.

(procedure (subexpr-to-mathml)
  (var result (nil))

  Parse any prescripts for tensor indices

  (var limit-style (in? (read-token) $limit-commands))

  (cond ((null? (read-token))
         (cond ((not-null? mmultiscripts)
                (prepend mmultiscripts (result-element "mrow") mprescripts)
                (return mmultiscripts))
               (else
                (return (result-element "mrow")))))
        ((in? (read-token) $left-delimiters)
         (set result (call heuristic-subexpression)))
        (else 
         (set result (call piece-to-mathml))))

  (var base result)
  Incorporate the following  subscript and superscript, if present
  Parse any postscripts for tensor indices
  Place the sub- and superscripts

  (return result))

1.6.4. Subscripts and superscripts

Parsing subscripts and superscripts is not entirely straightforward. Firstly, signals subscripts and superscripts by placing ^ or _ tokens after the base object, whereas MathML signals subscripts and superscripts by an element that wraps the base object. In practice, this means we must check for the ^ or _ tokens separately after processing other kinds of objects, and go back and wrap the previous object if ^ or _ is found.

Secondly, subscripts and superscripts in actually stand for underscripts and overscripts, when they are used in big operators (in display style). This is called the “limit style” of subscripts and superscripts. MathML makes clear distinctions between subscripts/superscripts, and underscripts/overscripts, and we must disambiguate between the two styles by looking at the object that the ^ or _ token is being applied to.

Finally, parsing tensor indices (multiple indices for a single base expression) opens another can of worms.

The following subportion of the subexpr-to-mathml procedure determines if the token immediately following a piece is ^ or _, and if so, puts the previous result element into a MathML subscript (msub), superscript (msup), underscript (munder), or overscript (mover) wrapper. Also, a combination of a subscript and a superscript should result in a combined MathML msubsup or munderover element, to avoid the scripts being staggered.

If no subscript or superscript is found, the previous result element is left alone.

(var subscript (nil))
(var superscript (nil))

(cond
  ((str-eq? (read-token) "_")
   (advance-token)
   (set subscript (call piece-to-mathml)))
  ((str-eq? (read-token) "^")
   (advance-token)
   (set superscript (call piece-to-mathml))))

(cond
  ((str-eq? (read-token) "_")
   (advance-token)
   (set subscript (call piece-to-mathml)))
  ((str-eq? (read-token) "^")
   (advance-token)
   (set superscript (call piece-to-mathml))))


(cond 
  ((not-null? mmultiscripts)
   (prepend mmultiscripts
            base
            mprescripts)
   (prepend mmultiscripts 
            (if (not-null? subscript) 
                subscript  
                (result-element "none"))
            mprescripts)
   (prepend mmultiscripts 
            (if (not-null? superscript) 
                superscript 
                (result-element "none"))
            mprescripts)))
(cond 
  ((not-null? mmultiscripts)
   (set result mmultiscripts))

  ((and (not-null? subscript) (not-null? superscript))
   (set result (result-element 
                   (if limit-style "munderover" "msubsup")
                   (base subscript superscript))))

  ((not-null? subscript)
   (set result (result-element 
                   (if limit-style "munder" "msub")
                   (base subscript))))
         
  ((not-null? superscript)
   (set result (result-element 
                   (if limit-style "mover" "msup")
                   (base superscript)))))

1.6.4.1. Commands whose subscripts and superscripts are limits

subscripts and superscripts applied to the following commands should be understood as MathML underscripts and overscripts. For other commands, subscripts and superscripts continue to be mapped to subscripts and superscripts.

(var $limit-commands (table
  Big operators
  Big word operators
  ("\\underbrace" (nil))
  ("\\overbrace"  (nil))
  ("\\underline"  (nil))
  ("\\overline"   (nil))
))

When mathematical expressions are to be typeset inline, even the big operators drop the limit style. Nevertheless, the limits are still encoded as underscripts and overscripts in MathML; MathML has a separate attribute (movablelimits) to indicate that the limit style should be dropped.

1.6.4.2. Tensor indices

(var mmultiscripts (nil))
(var mprescripts (nil))

(cond ((and (str-eq? (read-token 0) "{")
            (str-eq? (read-token 1) "}")
            (or (str-eq? (read-token 2) "_")
                (str-eq? (read-token 2) "^")))
       (set mmultiscripts (result-element "mmultiscripts"))
       (set mprescripts (result-element "mprescripts"))
       (append mmultiscripts mprescripts)

       (while (and (str-eq? (read-token 0) "{")
                   (str-eq? (read-token 1) "}")
                   (or (str-eq? (read-token 2) "_")
                       (str-eq? (read-token 2) "^")))

              (var subscript (nil))
              (var superscript (nil))

              (advance-token)
              (advance-token)

              (cond ((str-eq? (read-token) "_")
                     (advance-token)
                     (set subscript (call piece-to-mathml)))
                    ((str-eq? (read-token) "^")
                     (advance-token)
                     (set superscript (call piece-to-mathml))))
              (cond ((str-eq? (read-token) "_")
                     (advance-token)
                     (set subscript (call piece-to-mathml)))
                    ((str-eq? (read-token) "^")
                     (advance-token)
                     (set superscript (call piece-to-mathml))))

              (append mmultiscripts 
                (if (not-null? subscript) 
                    subscript 
                    (result-element "none")))
              (append mmultiscripts 
                (if (not-null? superscript) 
                    superscript 
                    (result-element "none"))))))
(while (and (str-eq? (read-token 0) "{")
            (str-eq? (read-token 1) "}")
            (or (str-eq? (read-token 2) "_")
                (str-eq? (read-token 2) "^")))

       (cond ((null? mmultiscripts)
              (set mmultiscripts (result-element "mmultiscripts" (base)))
              (set mprescripts (nil))
              (cond ((or? (not-null? superscript)
                          (not-null? subscript))
                     (append mmultiscripts 
                             (if (not-null? subscript) 
                                 subscript 
                                 (result-element "none")))
                     (append mmultiscripts 
                             (if (not-null? superscript) 
                                 superscript 
                                 (result-element "none")))))))

       (var subscript (nil))
       (var superscript (nil))

       (advance-token)
       (advance-token)

       (cond ((str-eq? (read-token) "_")
              (advance-token)
              (set subscript (call piece-to-mathml)))
             ((str-eq? (read-token) "^")
              (advance-token)
              (set superscript (call piece-to-mathml))))
       (cond ((str-eq? (read-token) "_")
              (advance-token)
              (set subscript (call piece-to-mathml)))
             ((str-eq? (read-token) "^")
              (advance-token)
              (set superscript (call piece-to-mathml))))

       (prepend mmultiscripts 
         (if (not-null? subscript) 
             subscript 
             (result-element "none"))
         mprescripts)
       (prepend mmultiscripts 
         (if (not-null? superscript) 
             superscript 
             (result-element "none"))
         mprescripts))

  

1.6.4.3. Degree marks

(cond ((and (str-eq? token2 "^") (str-eq? (read-token) "\\circ"))
       (advance-token)
       (append result 00b0 °))
       (cond ((str-eq? (read-token) "_")
              (set token2 "_")
              (advance-token))
             (else
              (set token2 (nil)))))

1.6.5. Prime marks

In , the markup ' is equivalent to ^{\prime}. This sets the \prime symbol — which is a vertically-centered glyph — to the superscript position, so that it appears as the familiar mathematical prime mark.

There is some ambiguity in how a prime mark should be translated:

  • In synthetic geometry, variables using prime marks — for example, A′ and A″ — are usually intended to be related to the unadorned variable — A — but refer to distinct objects or quantities.
  • On the other hand, in differential calculus, a prime mark on a function — for example, f′ — means that the function f is to be differentiated. The mathematical operation of differentiation of a function is denoted by the prime mark .

In view of these two interpretations of the prime mark, we employ the following heuristic approach when converting formulae with prime marks into MathML:

  • In , the prime mark symbol is represented by the ASCII apostrophe character ('), but it should be translated to the proper Unicode prime mark character for MathML (2032 ′).
  • If one or two prime marks follow a letter symbol, then the prime marks are considered to be affixed to the symbol, and part of its identity. Thus the MathML result would be: <mi>a′</mi>, for example. In the case of two prime marks, the marks should be coalesced into one Unicode double-prime character (2033 ″).
  • If there are so many (three or more) prime marks following a letter symbol, then it is likely that the prime marks play the role of an operator. Thus in MathML: <mi>f</mi> <mo>′</mo> <mo>′</mo> <mo>′</mo> .
  • If the symbol before the prime marks is not a lone letter, then it is likely that the prime marks serve as operators. For example: (f + g)′ for the derivative of the sum of two functions.

The following procedure fragment implements this heuristic.

        ((in? token $named-identifiers)
         (set result (call piece-to-mathml))
         (cond ((str-eq? (read-token) "'")
                (advance-token)
                (cond ((str-eq? (read-token) "'")
                       (advance-token)
                       (cond ((str-eq? (read-token) "'")
                              (advance-token)
                              Add arbitrary number of prime marks as operators)
                             (else
                              (append result 2033 ″))))
                      (else
                       (append result 2032 ′))))))

For three or more prime marks, loop through them and enter them separately as operators.

(set result (result-element "mrow" 
              (result
               (result-element "mo" 2032 ′)
               (result-element "mo" 2032 ′)
               (result-element "mo" 2032 ′))))
(while (str-eq? (read-token) "'")
       (append result (result-element "mo" 2032 ′))
       (advance-token))

1.6.6. Parse a subexpression chain

(procedure (subexpr-chain-to-mathml stop-tokens)
  (var result (nil))
  (var mrow   (nil))
  (var mfrac  (nil))
  (var wrapped-result (nil))

  (while (and (not-null? (read-token)) 
              (not (in? (read-token) stop-tokens)))

    (cond 
      Parse \over
      Parse \choose

      (else
        (var node 
          Parse subexpressions, organizing them in a tree according to precedence
          )
        Collect results
        
    )))

  (cond ((not-null? mfrac)
         (append mfrac result)
         (return wrapped-result))
        (else
         (return result))))
  (var node (call subexpr-to-mathml))
  Collect results
    (cond ((not-null? mrow)
           (append mrow node))
          ((not-null? result)
           (set mrow (result-element "mrow" (result node)))
           (set result mrow))
          (else
           (set result node)))

1.6.6.1. Parsing old-style fractions and binomial coefficients

Fractions and binomial coefficients in the old style are written using the \over and \choose commands. The design of these commands, to put it mildly, is terrible. These commands take their arguments on the left- and right- sides, up to the edges of the current block (delimited by braces { and }). They are a pain to parse because they do not behave like other and commands and require special cases in code.

Nevertheless some authors still use them instead of the better designed — if a little more verbose — equivalents, so we must support them.

Whenever we encounter an \over or \choose, the current subexpression up to that point — i.e. the left argument to the command — is stowed away. Then an empty mrow is made, which will accumulate the rest of the arguments (the right argument to the command). At the end of the expression, the two pieces are put together.

Two separate variables, mfrac and frac are needed, to provide a location to store the arguments, and to signal at the end that a fraction (or binomial coefficient) is to be returned.

((str-eq? (read-token) "\\over")
 (advance-token)
 (set mfrac (result-element "mfrac" (result)))
 (set wrapped-result mfrac)
 (set mrow (nil))
 (set result (nil)))
((str-eq? (read-token) "\\choose")
 (advance-token)
 (set mfrac (result-element "mfrac" (("linethickness" "0")) (result)))
 (set wrapped-result (result-element "mrow" ((result-element "mo" "(")
                                     mfrac
                                     (result-element "mo" ")"))))
 (set mrow (nil))
 (set result (nil)))

1.6.7. Parsing arguments to commands

A required argument to a command (such as the argument {ab} in the markup \frac{ab}{2} is, by definition, a piece. An argument is almost always enclosed in between { and }, and books and articles on always talk about arguments as if this is always the case. But the book, and the implementation itself, says otherwise: arguments can be atoms without the braces. (In fact, itself (but not ) does not even have the concept of a “command with arguments”.)

Consequently, our parser does not need to specifically care about arguments either — we can just parse them as pieces.

1.6.7.1. Optional arguments

does have a concept of “optional argument” to commands. At a fundamental level optional arguments are different from required arguments. An optional argument is not a piece, but a series of pieces: the first piece being the opening bracket [, and the last piece being the closing bracket ]. Note that the brackets are not syntax characters of , and can be used literally in markup outside of optional arguments.

All this means we must optional argument processing must be handled by a special handler (below), and the handler must be explicitly called. Fortunately, optional arguments are very rarely used, so the added complexity is not too much to bear.

(var $optional-arg-stop-tokens (table
  Hard stop tokens
  ("]" (nil))
))

(procedure (optional-arg-to-mathml)
  (cond ((str-neq? (read-token) "[")
         (return (nil))))
  (advance-token)
  
  (var result (call subexpr-chain-to-mathml $optional-arg-stop-tokens))
  (cond ((str-eq? (read-token) "]")
         (advance-token)))
  (return result))

1.6.8. Hard stop tokens

("&" (nil))
("\\\\" (nil))
("}" (nil))
("$" (nil))
("\\end" (nil))
("\\right" (nil))
("\\bigr" (nil))
("\\Bigr" (nil))
("\\biggr" (nil))
("\\Biggr" (nil))
("\\choose" (nil))
("\\over" (nil))

1.7. Heuristic semantic parsing

Grouping
Parsing according to precedence levels

1.7.1. Grouping

The -to-MathML parser also does grouping of subexpressions. This means to convert a series of symbols that occur without nesting in , but really constitute a logical subexpression and should be grouped as such in the MathML output. For example, the markup (x+y+2z)^2 should be translated as:

<msup>
 <mrow>
  <mo>(</mo>
  <mrow>
   <mi>x</mi>
   <mo>+</mo>
   <mi>y</mi>
   <mrow>
    <mn>2</mn>
    <mi>z</mi>
   </mrow>
  </mo>)</mo>
 </mrow>
 <mn>2</mn>
</msup>

The straightforward translation misses the proper semantic encoding, and is not good MathML markup:

<mrow>
 <mo>(</mo>
 <mi>x</mi>
 <mo>+</mo>
 <mi>y</mi>
 <mo>+</mo>
 <mn>2</mn>
 <mi>z</mi>
 <msup>
  <mo>)</mo>
  <mn>2</mn>
 </msup>
</mrow>

(var $hard-stop-tokens (table Hard stop tokens ))

(var $right-delimiter-stop-tokens (table
  Hard stop tokens
  Right delimiters 
))

(procedure (heuristic-subexpression)
  (var result (result-element "mrow"))
  (append result (call piece-to-mathml))
  (append result (call subexpr-chain-to-mathml $right-delimiter-stop-tokens))
  (cond ((and (not-null? (read-token))
              (not (in? (read-token) $hard-stop-tokens)))
         (append result (call piece-to-mathml))))
  (return result))

1.7.2. Use of formal context-free grammars

In many situations, the first step for a computer to understand a language is being able to deduce a parse tree from the input (a list of tokens), in which the substructures in the input are clearly isolated.

Although basic parsing can be done without a parse tree — indeed, the system itself does not generate parse trees to digest its input — the vast majority of math formulae and expressions, as used by practitioners of mathematics, are based on nesting substructures. An example has already been shown at the beginning of this chapter.

The heuristic semantic parsing pursued by our parser, in essence, is the deduction of a parse tree for mathematical formulae without being given the exact formal grammar that the formulae are written in. The exact formal grammar, of course, does not exist.

Nevertheless, we can try to construct the required formal grammar. We will not actually be using a parser generator on the formal grammar to write the parser, because of problems with efficiency, robustness, and ambiguity. But it is still useful to write down pieces of the formal grammar, as a concise summary of what parse tree (i.e. MathML tree) we are to generate from the input.

RelationExpr | Expr RelationOp Relation
ExprTerm | Term AddOp Expr
TermSubexpr (MultiplyOp Term | Term)?
SubexprIdentifier | Numeral | ( Relation ) | BigOperation | FunctionApplication
FunctionApplicationIdentifier ( (Identifier | Numeral | Relation | BigOperation) )
BigOperationBigOp BigOperandExpr
BigOperandExprTerm | BigOperation
AddOp+ | − | ⊕ | ∪ | ∨ | …
MultiplyOp⋅ | ∕ | × | ⊗ | ∩ | ∧ | …
BigOp∑ | ∫ | ⋃ | ⋂ | …
RelationOp= | ∊ | ⊂ | ⊆ | …
Identifier[a-zA-Z] | α | … | Ω | sin | exp | …
Numeral−? [0-9]* (. [0-9]+)?

1.7.3. Precedence levels for operators

The goal is to expressions containing operators mixed together in a list, but turn the output into a tree, where each node is a subexpression where the operator leaf nodes all have the same precedence level. And of course, operators that bind more tightly are to occur nearer to the bottom of the tree (away from the root).

Definition of precedence levels for common operators
The general parsing procedure, for infix operators
Procedure to parse invisible groups

1.7.3.1. Parsing the general infix operator with precedence

The general procedure to parse infix operators with precedence is (surprisingly) easy and straightforward.

Consider first a simple case, an input like 3*4*5 + 6*7, where we are to separate the addition terms and the multiplication factors. Suppose that the current token is at 3. We read that in, and process it. Next, we look at the next token, which is *. This tells us that the next token (4) that follows is the other operand of the multiplication *. So we accumulate in our buffer: 3*4.

Then, we read the next token, the second * operator. This tells us again that whatever follows (5) is an operand. We then accumulate in the buffer: 3*4*5.

Finally, we encounter the + operator. Since this is not the same precedence level as the * operators we have been considering, we immediately flush the buffer, and start anew. Then the + is output, and processing at the token 6 progresses as at the first token 3.

Parsing, in general, of more than two precedence levels uses stacked functions. Each level of depth of the function calls handles one level of precedence. The operation of “flushing the buffer” then is just returning from a function call with the buffered result, passing control to the function on top — the caller — which will handle the next precedence level. See the next section for a concrete example.

Here is the actual implementation:

(procedure (collect-precedence-group operators stop-tokens reader)
  (var result (call reader stop-tokens))
  (var mrow (nil))

  (while (and (not-null? (read-token))
              (not (in? (read-token) stop-tokens))
              (in? (read-token) operators))
    (cond ((null? mrow)
           (set mrow (result-element "mrow" (result)))
           (set result mrow)))

    (append mrow (call piece-to-mathml))

    (cond ((and (not-null? (read-token))
                (in? (read-token) stop-tokens))
           (return result))
          (else
           (append mrow (call reader stop-tokens)))))

  (return result))

The parameter operators is a table of operators at the same given precedence level. The parameter reader is a function object that will be called to parse tokens at the next precedence level (deeper into the tree).

The mrow variable points to a mrow result element which serves as a buffer. But if the buffer is to contain one object (subtree) only, the mrow wrapper will be omitted.

Most of the apparent complexity in the collect-precedence-group procedure is actually checking when to stop processing. Clearly, the procedure must not loop past the end of input, but it also must stop processing at certain tokens (given by the parameter table stop-tokens). For example, the procedure typically stops processing at the } token marking the end of the current block. (Obviously, expression chains should not leak through a syntactic block.)

Compared to parsing precedence levels based on a context-free grammar, our implementation has a serious “win” in that no explicit back-tracking of tokens is required. In effect, the back-tracking happens by having the function call unwind. Thus the implementation is simplified with good performance characteristics: the time to process the input remains linear in the number of tokens; and the run-time memory required is linear in the nesting depth of the mathematical structures represented by the input markup. Yet, because each parsing level can be implemented by different functions, we still retain a lot of flexibility.

1.7.3.2. Invocation to parse a subexpression chain

This is the hook into subexpr-chain-to-mathml that parses subexpressions with precedence levels.

Level 1: relations

(call collect-precedence-group $relations-precedence-group stop-tokens

Level 2: addition-like operators

(lambda (stop-tokens) (call collect-precedence-group $addition-precedence-group stop-tokens

Level 3: multiplication-like operators

(lambda (stop-tokens) (call collect-precedence-group $multiplication-precedence-group stop-tokens

Level 4: invisible multiplication-like operators

collect-invisible-group)))))

1.7.3.3. Definition of precedence levels for common operators

(var $relations-precedence-group $relation-symbols)
(var $addition-precedence-group (table
  ("+" (nil))
  ("-" (nil))
  ("\\oplus" (nil))
))

(var $multiplication-precedence-group (table
  ("*" (nil))
  ("\\times" (nil))
  ("\\cdot" (nil))
  ("/" (nil))
))

1.7.3.4. Parsing groups with invisible multiplication

(procedure (collect-invisible-group stop-tokens)
  (var result (call subexpr-to-mathml))
  (var mrow (nil))

  (while (and (not-null? (read-token))
              (not (in? (read-token) stop-tokens))
              (or (in? (read-token) $named-identifiers)
                  (in? (read-token) $left-delimiters)))

    (cond ((null? mrow)
           (set mrow (result-element "mrow" (result)))
           (set result mrow)))

    (append mrow (result-element "mo" ("\u2062")))

    (cond ((and (not-null? (read-token))
                (in? (read-token) stop-tokens))
           (return result))
          (else
           (append mrow (call subexpr-to-mathml)))))

  (return result))

1.8. Greasemonkey implementation

1.8.1. Set-up for the user script

// This script was automatically generated from a literate source.  
// Do not edit this file; look at the literate source instead!
//
// Greasemonkey user script to 
// Display LaTeX in Web pages by transforming to MathML
//
// Home page: http://gold-saucer.afraid.org/mathml/greasemonkey/
//
// --------------------------------------------------------------------
Copyright notice
// --------------------------------------------------------------------

User script data
Subroutines for result trees
Workaround for Mozilla not supporting mathvariant

The following are basic metadata describing the user script, required by the Greasemonkey extension. Also sets the default Web pages where the user script is to activate.

// ==UserScript==
// @name          Display LaTeX
// @namespace     http://gold-saucer.afraid.org/mathml/greasemonkey/
// @description   Display LaTeX in Web pages by transforming into MathML
// @include       http://gold-saucer.afraid.org/mathml/greasemonkey/
// @include       http://gold-saucer.afraid.org/writings/Display_LaTeX_sandbox
// @include       http://planetmath.org/*
// ==/UserScript==

1.8.2. DOM trees

The MathML XML namespace, needed for creating MathML elements

const mmlns = 'http://www.w3.org/1998/Math/MathML'; Create MathML result element Append MathML result element to another element’s content Prepend MathML result element before another child element Change attribute of result element Change attribute of result element Compatibility layer for the Epiphany browser Greasemonkey configuration

1.8.2.1. Create MathML result element

function result_element(tag, num_attrs)
{
  var node = document.createElementNS(mmlns, tag);

  var k = 2;
  while(--num_attrs >= 0) {
    if(arguments[k+1] != null) {
      node.setAttribute(arguments[k], arguments[k+1]);
    }
    k += 2;
  }
    
  for(; k < arguments.length; k++) {
    if(arguments[k] != null) {
      if(typeof(arguments[k]) == 'string')
        node.appendChild(document.createTextNode(arguments[k]));
      else
        node.appendChild(arguments[k]);
    }
  }

  return node;
}

1.8.2.2. Append child to MathML result element

function result_element_append(parent, child)
{
  if(parent != null && child != null) {
    if(typeof(child) == 'string')
      parent.appendChild(document.createTextNode(child));
    else
      parent.appendChild(child);
  }
}
function result_element_prepend(parent, child, next)
{
  if(next == null)
    result_element_append(parent, child);
  else if (parent != null && child != null)
    parent.insertBefore(child, next);
}

1.8.2.3. Modify attribute of result element

function result_set_attr(elem, attr, value)
{
  if(elem != null && attr != null) {
    if(value != null)
      elem.setAttribute(attr, value);
    else
      elem.removeAttribute(attr);
  }
}
function result_append_attr(elem, attr, value)
{
  if(elem != null && attr != null) {
    var old_value = elem.getAttribute(elem, attr);
    if(old_value == null)
      elem.setAttribute(attr, value);
    else
      elem.setAttribute(attr, old_value + value);
  }
}

1.8.3. Configuration items for user script

if(!this.GM_getValue) {
  this.GM_getValue = function(key, value) { return value; }
  this.GM_log = function() {}
}
if(this.GM_registerMenuCommand) {
  GM_registerMenuCommand("Enable native display of math images", 
    function() {
      GM_setValue("patch-images", true);
      do_patch_images = true;
      patch_element(document.documentElement);
    });
  GM_registerMenuCommand("Disable native display of math images",
    function() {
      GM_setValue("patch-images", false);
    });
}

1.8.4. Workaround for Mozilla not supporting mathvariant

Unfortunately, Mozilla does not support the mathvariant attribute in MathML 2.0. We fix this by directly substituting any Roman letters by their variant characters.

This process is done in JavaScript and not in the S-Expressions mini-language, because the mini-language is not powerful enough for this, and the resultant MathML is not portable anyway. (The list of character codes seems to be specific to the Mozilla and not standard.)

const char_map = {
  'script': [
    '\uEF35', '\u212C', '\uEF36', '\uEF37', '\u2130', '\u2131', 
    '\uEF38', '\u210B', '\u2110', '\uEF39', '\uEF3A', '\u2112', 
    '\u2133', '\uEF3B', '\uEF3C', '\uEF3D', '\uEF3E', '\u211B', 
    '\uEF3F', '\uEF40', '\uEF41', '\uEF42', '\uEF43', '\uEF44', 
    '\uEF45', '\uEF46' ],

  'fraktur': [
    '\uEF5D', '\uEF5E', '\u212D', '\uEF5F', '\uEF60', '\uEF61', 
    '\uEF62', '\u210C', '\u2111', '\uEF63', '\uEF64', '\uEF65', 
    '\uEF66', '\uEF67', '\uEF68', '\uEF69', '\uEF6A', '\u211C', 
    '\uEF6B', '\uEF6C', '\uEF6D', '\uEF6E', '\uEF6F', '\uEF70', 
    '\uEF71', '\u2128' ],

  'double-struck': [
    '\uEF8C', '\uEF8D', '\u2102', '\uEF8E', '\uEF8F', '\uEF90', 
    '\uEF91', '\u210D', '\uEF92', '\uEF93', '\uEF94', '\uEF95', 
    '\uEF96', '\u2115', '\uEF97', '\u2119', '\u211A', '\u211D', 
    '\uEF98', '\uEF99', '\uEF9A', '\uEF9B', '\uEF9C', '\uEF9D', 
    '\uEF9E', '\u2124' ],
};

const uppercase_re = /[A-Z]/;

function fix_mathvariant(node, style)
{
  if(node.nodeType == node.TEXT_NODE) {
    if(style != null && style != '' && style in char_map) {
      node.data = node.data.replace(uppercase_re,
        function(s) { return char_map[style][s.charCodeAt(0)-65] });
    }
  } else if(node.nodeType == node.ELEMENT_NODE) {
    var new_style = node.getAttribute('mathvariant');
    if(new_style != null && new_style != '')
      style = new_style;

    for(var i=0; i < node.childNodes.length; i++)
      fix_mathvariant(node.childNodes.item(i), style);
  }
}

1.8.5. Patching the document

The user script calls patch_element on the root element to patch all markup within the document.

Tokenizing  input
Supplemental processing of MathML
Procedure to patch an img
Procedure to patch a text node
Procedure to patch a DOM subtree
var do_patch_images = GM_getValue("patch-images", false);
var delayed_patch = GM_getValue("delayed-patch", false);
patch_element(document.documentElement);

1.8.5.1. Patch a DOM subtree

The patch_element procedure takes the given node, and replaces all text in $ signs occurring within that node. It recursively calls itself to handle the children of the given node.

The procedure can be written more succinctly, without an explicit recursion, as:

var iter = document.createNodeIterator(
    document.documentElement, NodeFilter.SHOW_TEXT, null, true);

var n;
while(n = iter.nextNode()) {
  patch_text(n);
}

However, the DOM node iteration API is not supported by Mozilla, so we have to write the procedure the “long” way, as given below.

function patch_element(node)
{
  if(node.nodeType == node.TEXT_NODE)
    patch_text(node);
  else if(node.nodeType == node.ELEMENT_NODE) {

If the current node is a text box control of an HTML form, then do not replace markup there. Not only does it not display properly (the text box only displays only plain text), the user may well be editing some markup in the text box! Also, do not attempt to replace markup in (JavaScript) scripts either; the dollar sign used in JavaScript confuses the parser.

if(node.tagName == 'TEXTAREA' || node.tagName == 'textarea' || node.tagName == 'INPUT' || node.tagName == 'input' || node.tagName == 'SCRIPT' || node.tagName == 'script') return; if(do_patch_images && (node.tagName == 'IMG' || node.tagName == 'img')) { if(!delayed_patch) patch_img(node); else node.addEventListener("click", patch_img, false); return; } var child = node.firstChild; while(child) { var next = child.nextSibling; patch_element(child); child = next; } } }

1.8.5.2. Patch an image

It is easy to extend the math patching code so that it also patches the math images produced by LaTeX2HTML translator, and display them with the browser’s native MathML renderer instead.

Sometimes the browser’s MathML rendering turns out to look better than the raster images from LaTeX2HTML — especially if the user is reading Web pages at a font size smaller or larger than usual. Sometimes the MathML rendering is worse. So patching images is only done optionally.

But more importantly, by having the MathML translation code activate on LaTeX2HTML pages, we get more opportunities to test and debug our program.

function patch_img(node)
{

  

patch_img can also called from an event handler; in that case, the argument is an event object rather than a DOM node. We extract the DOM node stored in that event object.

if(node.currentTarget) node = node.currentTarget; var alt = node.getAttribute('alt'); if(alt == null || /^\\includegraphics|^\$\\displaystyle \\xymatrix/.test(alt)) return; var latex_string = null; Prefer use of the MATH comment, if present instead of ALT attribute if(!latex_string && /^\$.+\$$/.test(alt) && !(/\.{3} \.{3}/.test(alt))) { latex_string = alt; } if(latex_string == null) return; Do the -to-MathML translation if(mathml == null) return; node.parentNode.replaceChild(mathml, node); }

LaTeX2HTML has a stupid misfeature where it will truncate a long math formula in an ALT attribute. Thus we cannot always recover the original math formula from the ALT attribute alone. But at least, LaTeX2HTML does — most of the time — put the full formula in a <-- MATH --> HTML comment occurring before the image. Use the contents of that instead, if the comment is present.


  

Check for images of formulae produced by LaTeX2HTML

if((node.parentNode.tagName == 'DIV' && node.parentNode.getAttribute('CLASS') == 'mathdisplay') || (node.parentNode.tagName == 'SPAN' && node.parentNode.getAttribute('CLASS') == 'MATH')) { var parent = node.parentNode; var previous = parent.previousSibling; const non_whitespace = /[^\s]/;

Skip over whitespace nodes in the DOM

if(previous && previous.nodeType == node.TEXT_NODE && !non_whitespace.test(previous.data)) previous = previous.previousSibling;

Sometimes the comment appears in the previous paragraph

if(previous && previous.nodeType == node.ELEMENT_NODE && previous.tagName == 'P' && previous.lastChild) { previous = previous.lastChild; if(previous && previous.nodeType == node.TEXT_NODE && !non_whitespace.test(previous.data)) previous = previous.previousSibling; }

Extract the full formulae for the image from the comment, if found

if(previous && previous.nodeType == node.COMMENT_NODE) { latex_string = previous.data.replace(/^\s*MATH\s*/, '') .replace(/\s+$/, ''); } }

1.8.5.3. Patch a text node

  tokens = new Object();
  tokens.list = tokenize_latex_math(latex_string);
  tokens.list.push(null);
  tokens.index = 0;

  var mathml = null;
  try {
    var mrow = v_subexpr_chain_to_mathml(tokens, {});

Fix display of variant characters, as described in

fix_mathvariant(mrow, null); mathml = document.createElementNS(mmlns, 'math'); mathml.setAttribute("latex", latex_string); mathml.setAttribute("mathvariant", "normal"); mathml.appendChild(mrow); mathml.addEventListener("click", post_process_mathml, false); } catch(e) { GM_log("Display LaTeX failed with error " + e + " on " + latex_string); }
function patch_text(node0) 
{
  var text = node0.nodeValue;
  var results = /\$[^$]+\$|\[tex\](.+?)\[\/tex\]/.exec(text);

  if(results) {
    var latex_string = (results[1] == null ? results[0] : '$'+results[1]+'$');

    Do the -to-MathML translation

    if(mathml == null)
      return;

Split up node0 into two nodes, at the point where the first $ sign occurs

var node2 = node0.splitText(results.index);

Now delete the original markup

node2.deleteData(0, results[0].length);

Make and insert a math element in place of the deleted markup

node2.parentNode.insertBefore(mathml, node2);

There may be more than one $...$ block in a single text node, and we processed only one of them, so process the others by a recursive call.

patch_text(node2); } }

1.8.6. Supplemental processing of MathML

The -to-MathML translator can post the generated MathML content to a Web service. This feature is intended for advanced users and developers — for example, the author of this program uses it to activate a separate script on his computer that speaks the MathML to him (voice synthesis), and to collect the MathML into a database for the program test suite.

To use this feature, the user must initialize it manually by setting the Mozilla preference “greasemonkey.scriptvals.http://gold-saucer.afraid.org/mathml/greasemonkey//Display LaTeX.click-post-url” to the desired Web service URL. Then, to post any displayed MathML content in the Web browser, click on the MathML.

It would be more ideal if this functionality was implemented by calling an external program directly instead, but unfortunately, Greasemonkey user scripts do not have access to the XPCOM interfaces required for this task.

Warning: This feature, if used carelessly, may lead to security or privacy breaches. (Hence, it is not activated by default.)

function post_process_mathml(event)
{
  var url = GM_getValue('click-post-url', null);
  if(url == null)
    return;

  var ser = new XMLSerializer();
  var xhr = GM_xmlhttpRequest({ 
    method: 'POST',
    url: url,
    headers: { 'Content-Type': 'text/xml; charset=utf-8',
               'Content-Location': document.location },
    data: ser.serializeToString(event.currentTarget),
    onerror: function(details) {
      alert("There was an error processing the request. " +
            "HTTP status code " + details.status + ' ' + details.statusText);
    },
    onload: function(details) {
      window.status = "Successfully posted MathML.  Status: " 
            + details.status + ' ' + details.statusText;
    }});

  window.status = "Posting MathML to " + url + "...";
}

1.9. Standalone JavaScript implementation

Trees
Serializing XML
Tokenizing  input
Program
String.prototype.repeat = function(n) {
  return new Array(n+1).join(this);
}


function xml_escape(s)
{
  s = s.replace('&', '&amp;').
        replace('<', '&lt;').
        replace('>', '&gt;');

  return s.replace(/[\u0080-\uffff]/, function(x) { return '&#' + x.charCodeAt(0) + ';' });
}

function xml_attr_escape(s)
{
  s = s.replace('&', '&amp;').
        replace('"', '&quot;').
        replace('<', '&lt;').
        replace('>', '&gt;');

  return s.replace(/[\u0080-\uffff]/, function(x) { return '&#' + x.charCodeAt(0) + ';' });
}

function serialize_mathml(tree, indent_level)
{
  var indent_string = ' '.repeat(indent_level);

  if(tree instanceof PlainXMLNode) {
    var start_tag = '<' + tree.tag;
    if(tree.attrs != null) {
      for(var a in tree.attrs)
        start_tag += ' ' + a + '="' + xml_attr_escape(tree.attrs[a]) + '"';
    }
    var empty_tag = start_tag + ' />';
    start_tag += '>';

    var end_tag = '</' + tree.tag + '>';

    if(tree.content.length == 0) {
      print(indent_string + empty_tag);
    } else if(tree.content.length == 1 &&
              typeof(tree.content[0]) == 'string') {
      print(indent_string + start_tag + xml_escape(tree.content[0]) + end_tag);
    } else {
      print(indent_string + start_tag);
      for(var i=0; i < tree.content.length; ++i)
        serialize_mathml(tree.content[i], indent_level+1);
      print(indent_string + end_tag);
    }
  } else if(typeof(tree) == 'string') {
    print(indent_string + xml_escape(tree));
  }
}

for(var j=0; j < arguments.length; ++j) {
  var input = arguments[j];

  var tokens = new Object();
  tokens.list = tokenize_latex_math(input);
  tokens.list.push(null);
  tokens.index = 0;

  print('<!--');
  for(var i=0; i < tokens.list.length; ++i) 
    print('token ' + i + ': ' + tokens.list[i]);
  print('-->');

  var mathml = v_subexpr_chain_to_mathml(tokens, {});
  
  print('<math xmlns="http://www.w3.org/1998/Math/MathML">');
  serialize_mathml(mathml, 1);
  print('</math>');
}

1.9.3. MathML trees

PlainXMLNode
result_element
result_element_append
result_element_prepend
result_set_attr
result_append_attr
function PlainXMLNode(tag)
{
  this.tag = tag;
  this.content = [];
  this.attrs = {};
}
function result_element(tag, num_attrs)
{
  var node = new PlainXMLNode(tag);

  var k = 2;
  while(--num_attrs >= 0) {
    if(arguments[k+1] != null) {
      node.attrs[arguments[k]] = arguments[k+1];
    }
    k += 2;
  }

  for(; k < arguments.length; k++) {
    if(arguments[k] != null) {
      node.content.push(arguments[k]);
    }
  }

  return node;
}
function result_element_append(parent, child)
{
  if(parent != null && child != null) {
    parent.content.push(child);
  }
}
function result_element_prepend(parent, child, next)
{
  if(next == null)
    result_element_append(parent, child);
  else if(parent != null && child != null) {
    for(var i = 0; i < parent.content.length; i++) {
      if(parent.content[i] == next) {
        parent.content.splice(i, 0, child);
        return;
      }
    }
  }
}
function result_set_attr(elem, attr, value)
{
  if(elem != null && attr != null) {
    if(value != null)
      elem.attrs[attr] = value;
    else
      delete elem.attrs[attr];
  }
}
function result_append_attr(elem, attr, value)
{
  if(elem != null && attr != null) {
    if(elem.attrs[attr] == null)
      elem.attrs[attr] = value;
    else
      elem.attrs[attr] += value;
  }
}

1.10. Python implementation

1.10.1. Set-up

# This script was automatically generated from a literate source.  
# Do not edit this file; look at the literate source instead!
#

import LaTeX2MathMLModule
import sys
import re

class TokenInput:
  def __init__(self, input):
    self.tokens = []
    self.tokens_index = 0
    self.tokenize_latex_math(input)
    self.tokens.append(None)

Tokenize  input

for input in sys.argv[1:]:
  t = TokenInput(input)
  sys.stdout.write("<math xmlns='%s'>\n" % LaTeX2MathMLModule.mmlns)
  sys.stdout.write(LaTeX2MathMLModule.v_subexpr_chain_to_mathml(t, {}).toxml("utf-8"))
  sys.stdout.write("\n</math>\n")

1.10.2. Routines for creating MathML result elements

import xml.dom.minidom

The MathML XML namespace, needed for creating MathML elements

mmlns = 'http://www.w3.org/1998/Math/MathML' document = xml.dom.minidom.getDOMImplementation().createDocument(None,None,None) Create MathML result element Append MathML result element to another element’s content Prepend MathML result element before another child element Change attribute of result element Change attribute of result element

1.10.2.1. Create MathML result element

def result_element(tag, num_attrs, *args):
  node = document.createElementNS(mmlns, tag)

  for i in range(0, num_attrs):
    if args[2*i+1] is not None:
      node.setAttribute(args[2*i], args[2*i+1])

  for i in range(num_attrs*2, len(args)):
    if args[i] is not None:
      if isinstance(args[i], unicode):
        node.appendChild(document.createTextNode(args[i]))
      else:
        node.appendChild(args[i])

  return node

1.10.2.2. Append child to MathML result element

def result_element_append(parent, child):
  if (parent is not None) and (child is not None):
    if isinstance(child, unicode):
      parent.appendChild(document.createTextNode(child))
    else:
      parent.appendChild(child)
def result_element_prepend(parent, child, next):
  if next is None:
    result_element_append(parent, child)
  elif (parent is not None) and (child is not None):
    parent.insertBefore(child, next)

1.10.2.3. Modify attribute of result element

def result_set_attr(elem, attr, value):
  if (elem is not None) and (attr is not None):
    if value is not None:
      elem.setAttribute(attr, value)
    else:
      elem.removeAttribute(attr)
def result_append_attr(elem, attr, value):
  if (elem is not None) and (attr is not None):
    old_value = elem.getAttribute(elem, attr)
    if old_value is None:
      elem.setAttribute(attr, value)
    else:
      elem.setAttribute(attr, old_value + value)