KTex

Working Draft

Introduction

KTex is an extensible, modular markup language, designed to simplify the process of creating and publishing text across diverse domains and formats.

1Overview

KTex is a markup language designed to streamline the preparation of publications. On its own, KTex’s functionality is minimal, but its modular design allows for extensive customization and a broad range of capabilities.

1.1Design Principles

  1. Tiny Core. The KTex specification is small. As a markup language, KTex focuses on its core purpose and does not attempt to serve as a general-purpose programming language.
  2. Extensible (via modules). The compact core of KTex can be transformed into a powerful tool with the help of modules. These modules, written in a supported programming language, can be distributed and shared just like any other software.
  3. CI/CD friendly. KTex is designed for integration into CI/CD pipelines, thereby supporting a broad spectrum of use cases, from straightforward blog posts to intricate publication processes.

1.2Use Cases

extend this section
  1. Libraries
  2. Blogs
  3. Academic Publications
  4. Technical Documentation
  5. News

1.3Problems

move this up and describe the problems that KTex solves compared to other software
  1. Multiple formats
  2. Spell checking
  3. Writing style
  4. Math typesetting

2Language

The KTex markup language is used to prepare document. A document is built from a sequence of blocks, which are built using operators.

The source text of a KTex document must be a sequence of SourceCharacters. These characters must be described by a sequence of Token lexical grammars. The lexical token sequence, omitting Ignored, must be described by a single Document syntactic grammar.

Lexical Analysis and Syntax Parse

The source text of a KTex document is first converted into a sequence of lexical tokens, Token, and ignored tokens, Ignored. This sequence of lexical tokens are then scanned from left to right to produce an abstract syntax tree (AST) according to the Document syntactical grammar.

In this document, lexical grammar productions employ lookahead restrictions to eliminate ambiguity and guarantee a singular, valid lexical analysis. Moreover, the production of lexical grammar is influenced by the value of these two flags:

  • isLineStart: a Boolean flag set to true at the beginning of a new line. This property remains active until a non-whitespace character is encountered.
  • isOperator: a Boolean flag set to true only within an “operator context” (explained later).

2.1Source Text

SourceCharacter
Any UTF-8 character

KTex documents are derived from a source text, composed of a sequence of SourceCharacters. Each SourceCharacter represents a UTF-8 character, which is informally referred to as a “character” throughout this specification.

2.1.1Ignored Tokens

UnicodeBOM
Byte Order Mark (0xFEFF)

2.1.2Comment

In KTex source documents, a line can be converted into a comment by placing a % marker at the beginning of the line. It’s important to note that leading whitespace characters before the % marker do not influence the comment. However, if the % marker is preceded by any number of non-whitespace characters, then the % marker and the remainder of the line are treated as regular text.

2.1.3Lexical Tokens

Tokens are later used as terminal symbols in KTex syntactic grammars.

2.1.4White Space

WhiteSpaceChar
Horizontal Tab (0x09)
Space (0x20)

White space is utilized to separate Words and Tokens, enhancing the readability of the source text. While white space characters generally do not contribute to the semantic meaning of a Document, they can appear within a String or Comment token, and are significant within “preserving” blocks.

Note WhiteSpace consists of multiple white space characters.

2.1.5Line Terminator

NewLineChar
New Line (0x0A)
CarriageReturnChar
Carriage Return (0x0D)

Line terminators divide text into lines. Two or more consecutive line terminators act as separators between blocks. A single line terminator is insignificant for “floating” blocks, but significant for “line-preserving” and “preserving” blocks.

2.1.6Punctuator

When isOperator:

When !isOperator:

Punctuator
{}[]

2.1.7Operator Name

When isOperator:

XXX
doweneedlastchar?

2.1.8Punctuation

When ! isOperator:

PunctuationChar
?!,:;()
"«»
'

2.1.9Words

When !isOperator:

WordCharacter
any character from alphabet

Alphabet configuration is defined by implementation. The default alphabet is the set of all Latin letters and Arabic numerals.

Note Apostrophe should be escaped to avoid ambiguity with the single Quote.

2.1.10Name

When isOperator:

Letter
abcdefghijklm
nopqrstuvwxyz
ABCDEFGHIJKLM
NOPQRSTUVWXYZ
Digit
0123456789

2.1.11Int Value

When isOperator:

Sign
-+

2.1.12Float Value

When isOperator:

2.1.13String Value

When isOperator:

2.1.14Boolean Value

When isOperator:

BooleanValue
truefalse

2.1.15Null Value

When isOperator:

NullValue
null

2.2Operators

2.3Document

Document
Blocklist

§Index

  1. Apostrophe
  2. BooleanValue
  3. BracketChar
  4. CarriageReturnChar
  5. Comment
  6. CommentCharacter
  7. Dash
  8. Digit
  9. Document
  10. Ellipsis
  11. EscapedBracket
  12. EscapedCharacter
  13. ExponentIndicator
  14. ExponentPart
  15. FloatValue
  16. FractionalPart
  17. Hyphen
  18. HyphenChar
  19. Ignored
  20. IntegerPart
  21. IntValue
  22. Letter
  23. LineTerminator
  24. Name
  25. NameContinue
  26. NameStart
  27. NewLineChar
  28. NonZeroDigit
  29. NullValue
  30. OperatorName
  31. Period
  32. PeriodChar
  33. Punctuation
  34. PunctuationChar
  35. Punctuator
  36. Sign
  37. SourceCharacter
  38. StringChar
  39. StringValue
  40. Token
  41. UnicodeBOM
  42. WhiteSpace
  43. WhiteSpaceChar
  44. Word
  45. WordBody
  46. WordCharacter
  47. WordStart
  48. XXX
  1. 1Overview
    1. 1.1Design Principles
    2. 1.2Use Cases
    3. 1.3Problems
  2. 2Language
    1. 2.1Source Text
      1. 2.1.1Ignored Tokens
      2. 2.1.2Comment
      3. 2.1.3Lexical Tokens
      4. 2.1.4White Space
      5. 2.1.5Line Terminator
      6. 2.1.6Punctuator
      7. 2.1.7Operator Name
      8. 2.1.8Punctuation
      9. 2.1.9Words
      10. 2.1.10Name
      11. 2.1.11Int Value
      12. 2.1.12Float Value
      13. 2.1.13String Value
      14. 2.1.14Boolean Value
      15. 2.1.15Null Value
    2. 2.2Operators
    3. 2.3Document
  3. §Index