KTex

Working Draft

Introduction

KTex is an extensible, modular markup language, designed to simplify the process of creating and publishing text across diverse domains and formats.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

1Overview

KTex is a markup language designed to streamline the preparation of publications. On its own, KTex’s functionality is minimal, but its modular design allows for extensive customization and a broad range of capabilities.

1.1Design Principles

Tiny Core. The KTex specification is small. As a markup language, KTex focuses on its core purpose and does not attempt to serve as a general-purpose programming language.
Extensible (via modules). The compact core of KTex can be transformed into a powerful tool with the help of modules. These modules, written in a supported programming language, can be distributed and shared just like any other software.
CI/CD friendly. KTex is designed for integration into CI/CD pipelines, thereby supporting a broad spectrum of use cases, from straightforward blog posts to intricate publication processes.

1.2Use Cases

extend this section

Libraries
Blogs
Academic Publications
Technical Documentation
News

1.3Problems

move this up and describe the problems that KTex solves compared to other software

Multiple formats
Spell checking
Writing style
Math typesetting

2Language

The KTex markup language is used to prepare document. A document is built from a sequence of blocks, which are built using operators.

The source text of a KTex document must be a sequence of SourceCharacters. These characters must be described by a sequence of Token lexical grammars. The lexical token sequence, omitting Ignored, must be described by a single Document syntactic grammar.

Lexical Analysis and Syntax Parse

The source text of a KTex document is first converted into a sequence of lexical tokens, Token, and ignored tokens, Ignored. This sequence of lexical tokens are then scanned from left to right to produce an abstract syntax tree (AST) according to the Document syntactical grammar.

In this document, lexical grammar productions employ lookahead restrictions to eliminate ambiguity and guarantee a singular, valid lexical analysis. Moreover, the production of lexical grammar is influenced by the value of these two flags:

isLineStart: a Boolean flag set to true at the beginning of a new line. This property remains active until a non-whitespace character is encountered.
isOperator: a Boolean flag set to true only within an “operator context” (explained later).

2.1Source Text

SourceCharacter

Any UTF-8 character

KTex documents are derived from a source text, composed of a sequence of SourceCharacters. Each SourceCharacter represents a UTF-8 character, which is informally referred to as a “character” throughout this specification.

2.1.1Ignored Tokens

Byte Order Mark (0xFEFF)

2.1.2Comment

Comment

isLineStart%CommentCharacterlistoptCommentCharacter

CommentCharacter

SourceCharacterLineTerminator

In KTex source documents, a line can be converted into a comment by placing a % marker at the beginning of the line. It’s important to note that leading whitespace characters before the % marker do not influence the comment. However, if the % marker is preceded by any number of non-whitespace characters, then the % marker and the remainder of the line are treated as regular text.

2.1.3Lexical Tokens

Tokens are later used as terminal symbols in KTex syntactic grammars.

2.1.4White Space

WhiteSpace

WhiteSpaceCharlistWhiteSpaceChar

WhiteSpaceChar

Horizontal Tab (0x09)

Space (0x20)

White space is utilized to separate Words and Tokens, enhancing the readability of the source text. While white space characters generally do not contribute to the semantic meaning of a Document, they can appear within a String or Comment token, and are significant within “preserving” blocks.

Note WhiteSpace consists of multiple white space characters.

2.1.5Line Terminator

LineTerminator

NewLineChar

CarriageReturnCharNewLineChar

NewLineChar

New Line (0x0A)

CarriageReturnChar

Carriage Return (0x0D)

Line terminators divide text into lines. Two or more consecutive line terminators act as separators between blocks. A single line terminator is insignificant for “floating” blocks, but significant for “line-preserving” and “preserving” blocks.