KTex
Working Draft
Introduction
KTex is an extensible, modular markup language, designed to simplify the process of creating and publishing text across diverse domains and formats.
Copyright
Copyright 2023-present Dimitri Kurashvili
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
1Overview
KTex is a markup language designed to streamline the preparation of publications. On its own, KTex’s functionality is minimal, but its modular design allows for extensive customization and a broad range of capabilities.
1.1Design Principles
- Tiny Core. The KTex specification is small. As a markup language, KTex focuses on its core purpose and does not attempt to serve as a general-purpose programming language.
- Extensible (via modules). The compact core of KTex can be transformed into a powerful tool with the help of modules. These modules, written in a supported programming language, can be distributed and shared just like any other software.
- CI/CD friendly. KTex is designed for integration into CI/CD pipelines, thereby supporting a broad spectrum of use cases, from straightforward blog posts to intricate publication processes.
1.2Use Cases
- Libraries
- Blogs
- Academic Publications
- Technical Documentation
- News
1.3Problems
- Multiple formats
- Spell checking
- Writing style
- Math typesetting
2Language
The KTex markup language is used to prepare document. A document is built from a sequence of blocks, which are built using operators.
The source text of a KTex document must be a sequence of SourceCharacters. These characters must be described by a sequence of Token lexical grammars. The lexical token sequence, omitting Ignored, must be described by a single Document syntactic grammar.
Lexical Analysis and Syntax Parse
The source text of a KTex document is first converted into a sequence of lexical tokens, Token, and ignored tokens, Ignored. This sequence of lexical tokens are then scanned from left to right to produce an abstract syntax tree (AST) according to the Document syntactical grammar.
In this document, lexical grammar productions employ lookahead restrictions to eliminate ambiguity and guarantee a singular, valid lexical analysis. Moreover, the production of lexical grammar is influenced by the value of these two flags:
isLineStart
: aBoolean
flag set totrue
at the beginning of a new line. This property remains active until a non-whitespace character is encountered.isOperator
: aBoolean
flag set totrue
only within an “operator context” (explained later).
2.1Source Text
KTex documents are derived from a source text, composed of a sequence of SourceCharacters. Each SourceCharacter represents a UTF-8 character, which is informally referred to as a “character” throughout this specification.
2.1.1Ignored Tokens
2.1.2Comment
In KTex source documents, a line can be converted into a comment by placing a % marker at the beginning of the line. It’s important to note that leading whitespace characters before the % marker do not influence the comment. However, if the % marker is preceded by any number of non-whitespace characters, then the % marker and the remainder of the line are treated as regular text.
2.1.3Lexical Tokens
Tokens are later used as terminal symbols in KTex syntactic grammars.
2.1.4White Space
White space is utilized to separate Words and Tokens, enhancing the readability of the source text. While white space characters generally do not contribute to the semantic meaning of a Document, they can appear within a String or Comment token, and are significant within “preserving” blocks.
2.1.5Line Terminator
Line terminators divide text into lines. Two or more consecutive line terminators act as separators between blocks. A single line terminator is insignificant for “floating” blocks, but significant for “line-preserving” and “preserving” blocks.
2.1.6Punctuator
When isOperator
:
= | ] |
When !isOperator
:
{ | } | [ | ] |
2.1.7Operator Name
When isOperator
:
2.1.8Punctuation
When ! isOperator
:
? | ! | , | : | ; | ( | ) |
" | “ | ” | „ | « | » | |
' | ‘ | ’ | ‚ | ‹ | › | |
– | — | … |
[ | ] | { | } |
2.1.9Words
When !isOperator
:
Alphabet configuration is defined by implementation. The default alphabet is the set of all Latin letters and Arabic numerals.
2.1.10Name
When isOperator
:
a | b | c | d | e | f | g | h | i | j | k | l | m |
n | o | p | q | r | s | t | u | v | w | x | y | z |
A | B | C | D | E | F | G | H | I | J | K | L | M |
N | O | P | Q | R | S | T | U | V | W | X | Y | Z |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
2.1.11Int Value
When isOperator
:
- | + |
2.1.12Float Value
When isOperator
:
2.1.13String Value
When isOperator
:
2.1.14Boolean Value
When isOperator
:
true | false |
2.1.15Null Value
When isOperator
:
2.2Operators
2.3Document
§Index
- Apostrophe
- BooleanValue
- BracketChar
- CarriageReturnChar
- Comment
- CommentCharacter
- Dash
- Digit
- Document
- Ellipsis
- EscapedBracket
- EscapedCharacter
- ExponentIndicator
- ExponentPart
- FloatValue
- FractionalPart
- Hyphen
- HyphenChar
- Ignored
- IntegerPart
- IntValue
- Letter
- LineTerminator
- Name
- NameContinue
- NameStart
- NewLineChar
- NonZeroDigit
- NullValue
- OperatorName
- Period
- PeriodChar
- Punctuation
- PunctuationChar
- Punctuator
- Sign
- SourceCharacter
- StringChar
- StringValue
- Token
- UnicodeBOM
- WhiteSpace
- WhiteSpaceChar
- Word
- WordBody
- WordCharacter
- WordStart
- XXX