With the changes in the current markup-redo
branch, I'd like to pause before integration (primarily understanding orgmode.nvim uses the main branch) to have some discussion about the work here, and compile a 1.0 checklist.
Changes to markup
To start, thoughts on changes in the markup-redo
branch. The first big one is that all markup parsing is completely removed, and in place of processing text as hidden nodes, whitespace delimited words are now parsed as (expr)
nodes. An expression here is parsed first for looking for ascii symbols as hidden nodes, then letters as "str"
, then numbers as "num"
, and finally any remaining symbols as "sym"
.
The reasoning for this set up is first that anything parsed in the parser is unchangeable, so markup was difficult to customize, and if a user wanted to add another link style (such as markdown-style links), they were unable to do so. Now, they just need to write queries and no modification for the parser is required.
For example, a (*b1 cdef*)
is parsed to (expr) (expr) (expr)
and when looking at hidden nodes, (expr "str") (expr "(" "*" "str" "num") (expr "str" "*" ")")
. Which we can query for in a variety of ways. Unfortunately, a major caveat for this is the fact that tree-sitter queries capture exactly one node (see tree-sitter/tree-sitter#1508), so capturing alone (without writing a directive or callback of some sort) will not be able to highlight markdown as is. Additionally, we need to account for nested expressions.
The other major caveat is that anonymous nodes are unable to be anchored (tree-sitter/tree-sitter#1461), so we can't make the queries as precise as I would like right now, but single node queries can just use the #match?
predicate so I'm not worried about that.
The positive side of this is that because the markup characters will be hidden nodes, they are easily queried for and whatever algorithm is applied will generally only need to look at very few nodes. Any language should be able to do that very quickly. Queries to find possible pairs are in the branch in the markup.scm
file, and I'm working the kinks out of a lua implementation of this pseudocode:
def markup(nodes):
# assumes sorted list of nodes
# all nodes should be from a single (paragraph), (itemtext), or (item)
seeking = []
markup = []
for node in nodes:
if node.type == start:
seeking.append((node.type, node))
elif node.type in seeking:
ix = seeking.index(node.type)
stnode = seeking[ix][1]
if validate(node, stnode): # check the pre/post markup or whatever else, can be modified as desired
# stop seeking everything after the node we just matched: *a /b* Like this '/', when we just matched the * after the b
seeking = seeking[:ix-1]
# If we complete a verbatim-style markup, we need to purge everything interior of it
markup.append({'type': node.type, 'range': (stnode.start, node.end)})
return markup
The details of this algorithm will change depending on exactly which node is captured (The symbol vs (expr)
) and whether or not we're using match
predicate in advance. My example does not do that, but after writing my lua example I'm thinking it makes sense to do so.
Note how easily constructed the markup queries are via some generative code for customization.
Lastly, queries/algorithms are needed for latex fragments, subscript and superscript, and bracketed expressions for the *scripts.
Regex patterns
I still need to work through the list of changes I made, but one change I've made/am still making to the markup branch is I've tried replacing specific patterns as often as possible with some variant of (expr)
. I think using queries to determine if a propery plan name is used, for example, is much cleaner than throwing a parser error. And this way it's easier to change languages. I think there are limits though, and it might be nice to simply support different languages in the parser directly (for example, END
, PROPERTIES
, etc.). That could be read from a file, or simply left for others to re-compile to their own language.
Additionally, queries are highlighting, so allowing more things to be specified as queries nicely can be pretty cool. For example, I might only ever use TITLE
and FILETAGS
directives, so I could consider highlighting those nicely and any other patterns as an error.
A good example here is timestamp contents. Right now, I've hard coded possible regex expressions. Should that just be queriable? That could be nice if people want different formats. On the other hand, having fields and nodes directly in the tree is really nice, and I think in timestamps there are few enough formats that we can just support all of them in the parser.
Fields and nodes/aliases
When writing the parser I didn't add a lot of fields because it was constantly in so much flux. But they're useful even if they link directly to a named node with the same name, because access via a name can be a lot more convenient than named nodes, even if there is a small number of nodes.
I've added a large number of (name)
nodes, and a few others, and many fields. I'll try to compile a list later today. When thinking about where I've added nodes, one of the things I was thinking about was incremental selection via nodes, I just want it to make sense. For example, tag -> taglist -> item -> headline -> section.
Versioning
These changes were a lot more than I'd like to do in a single commit or merge in the future, but since I was writing queries as I went, I kept finding small changes that make using the parser a lot more straightforward. I don't work on coding projects that are public really, so I don't think much about this, so this is kind of a "Yeah, why didn't you do that sooner?" section.
So AFTER 1.0.0, I'll be using consistent conventional commits in the future and an auto-updating semantic versioning system based on that. With major.minor.patch
versioning: fix
changes increment patch
, feat
changes increment minor
, and any breaking change (appending a !
to the scope) will increment major versioning. I just want to make sure anything using the parser has a tag that they can link to so changes can be made to main
without breaking dependent projects.
Specific questions
- Are there any missing/hidden aliases or fields that would be useful?
- Are the names of aliases and fields sensible?
- Should newlines be part of
(body)
? (Newlines before body in (section)
and in (document)
)
So, should an empty section have a body, basically, or should a body exist only if there is an
element?
- If we have text that is a paragraph followed by a footnote after a new line, is that parsed in
emacs as a footnote reference or a definition?
- Should
(_element)
be available to listitem
s and in drawer contents, or should those just be
expr
essions? I'm under the impression that whether or not elements are in a list and drawer
are customizable options for orgmode, and I prefer to keep the parser simple as possible. So it
makes sense to me to inject an org parser in the itemtext
if a user wants it.
- Should
(taglist)
be a named node?
1.0.0 checklist
I want to expand upon this list (some items should be multiple), but really quickly:
- [x] Revisit all tests - A lot are out of date
- ~~[ ] Write tests for queries~~
- [x] Fix some regexes to better query for: :block: is the whole name, the goal would be ':', (name), ':'. (easier highlighting ':'s)
- ~~[ ] Cleanup table precedences. Yikes. (I don't care anymore)~~
- ~~[ ] Add semantic versioning git hook script~~
- [x] Revisit the readme build instructions
- [x] Revisit the npm dependencies (could really use some help here, no idea what I'm doing there)
- [x] Add plan entries
- [x] Add newlines to
(contents)
and (body)
- [x] Fix failing headline test
Thoughts? Anything else I'm missing?