Creating a mixed parser with SQL inside python strings

yogevyuval · February 16, 2024, 6:44pm

Hi,
I’m trying to follow the examples in CodeMirror Mixed-Language Parsing Example to create a mixed python and SQL language support. My end goal is to use this in Jupyter notebook in the following way:

some_query_str = "SELECT * FROM table where a > 1"

In this example, this is a python line which should be highlighted as python, but the content inside the string quotation is an SQL query which should be highlighted with SQL.

Could you provide any guidance?

I took the html + javascript example which seems very very similar to what I need, however I face an issue:

When switching from javascript inside html to javascript inside python the highlighting of the javascript code stops working, treated as a python string. (I changed the check to be node.name == “String”), with an example of:

some_query_str = "let a = 5"

Even though there is a “String” node, it stays colored in red as a string and not changed to SQL

NickTomlin · February 18, 2024, 2:10am

What have you tried so far? Can you reproduce in a sandbox?

Here is an example that mixes yaml with JavaScript that should be similar.

The SQL package does expose a number of different dialects, so it’s potentially an issue with how you are configuring the parser and the language support that is required for highlighting.

yogevyuval · February 18, 2024, 3:06pm

@NickTomlin
Check out this example:

I switched to JS inside python, and change the searched node to be “String”

The string is shown as red without JS highlighting

NickTomlin · February 18, 2024, 4:06pm

Ah, thank you for that Playground example, very interesting!

I did some more playing around and it looks like the mixed parsing is parsing the string as JS but autocomplete and highlighting is disabled.

If I use a multiline string autocomplete will work for JS, but, the syntax highlighting is not working.

I wonder if this is because it’s not enough to simply replace the String node, perhaps it’s worth looking into using overlay nesting to try and replace within the String node itself

yogevyuval · February 18, 2024, 6:54pm

Yeah, the only thing that worked was using overlays, like in the example here.

Super hard to understand / debug this with the current docs

NickTomlin · February 18, 2024, 7:36pm

Glad you figured it out! Yes, it’s a bit tricky, I’ve also found myself cross-referencing a lot of different docs, examples, and threads here to solve for my own use case.

If you don’t mind, would you be able to share an example playground, snippet, or repo for future travelers (one of which may be myself )?

marijn · February 18, 2024, 8:25pm

What you were doing was parsing the entire Python string, including the quotes, as JavaScript. That will give you string highlighting.

NickTomlin · February 20, 2024, 1:56am

What you were doing was parsing the entire Python string, including the quotes, as JavaScript

Could you provide some more details on the correct approach here? The way this is written, It sounds like what the OP (and I to some extent) were trying to do.

Here’s an somewhat isolated playground that does the following:

const wrap = parseMixed((node, input) => {
  return activateOnNodes.has(node.name) ? {
      parser: javascriptLanguage.parser,
    } : null
});

const mixedParserLanguage = pythonLanguage.configure({
  wrap,
});

The result is a parse tree that looks like this:

Script
  AssignStatement
    VariableName
    AssignOp
    Script
      ExpressionStatement
        String

Which does seem backwards (I’d expect this to be String > Script > etc). Is that where ranges in overlay comes into play?

marijn · February 20, 2024, 8:02am

If you don’t want the quotes to be part of the inner language’s document, use an overlay.

Non-overlay nested parses replace the entire node with the parse tree. The Script(ExpressionStatement(String)) is the JavaScript parse tree. Nothing is backwards.

NickTomlin · February 20, 2024, 8:08pm

Ah! I think I was getting confused by the fact that both Python and JavaScript share @top tokens named Script and String; this makes sense now .

I think I’m still struggling with what the appropriate way to use overlay in situations like this is with a token like String.

const wrap = parseMixed((node, input) => {
  if (!activateOnNodes.has(node.name)) { return null }
  if (input) {
    console.log('name', node.type.isTop, node.name, 'input:', input.string.slice(node.from + 1, node.to -1))
  }  
  return {
      parser: javascriptLanguage.parser,
      // naive way of trying to overlay just the JS not the wrapping `"`
      // "const x = 1"
      // ==>
      // const x = 1
      overlay: {
        from: node.from + 1,
        to: node.to - 1
      }
  } 
});

The console.log outputs the correct “slice” of text:

name String input: const x = 1

sandbox

marijn · February 20, 2024, 8:23pm

Is anything going wrong with the thing you’re doing in that code?

NickTomlin · February 21, 2024, 8:10pm

yes

I adapted this to match the original linked post and take into account the closing " and it works.

I’ve noticed that I need to use the readonly {from: number, to: number}[] of overlay

E.g.

const overlay =  [{
  from,
  to: node.to - (closingQuote ? 1 : 0)
}]

works, but the same information as an object does not

const overlay = {
  from,
  to: node.to - (closingQuote ? 1 : 0)
 }

Here’s a “full” example that wholesale copies the code that is much more resilient to different forms of String.

NickTomlin · February 21, 2024, 8:11pm

In general, overlays are tricky because I haven’t found a way to get information out about what is wrong, is there diagnostic information that can be gotten out of the tree that I can use to help highlight issues like the aforementioned range issue?

yogevyuval · February 23, 2024, 4:32pm

@marijn Thanks, I got it working with the overlays method.

I’m using this inside JupyterLab, and it seems that everything is pretty much highlighted with the same “keyword” color. Can I customize it so builtins are highlighted differently? Currently from the DOM it seems that builtins / types are not getting any class names. I tried using styleTags but with no success

For reference my mixed language looks like this:

const myMixedPythonLRLanguage = LRLanguage.define({
parser: mixedParser.configure({
props: [
styleTags({
Type: t.number,
Identifier: t.number
})
]
})
})

marijn · February 23, 2024, 4:48pm

No, ‘builtins’ aren’t parsed different from regular identifiers, so you can’t highlight them differently.

yogevyuval · February 23, 2024, 5:00pm

Basically im trying to override this the “styleTags” that’s defined in lang-sql. When I edit the source code I can change the highlighting for Type for example, im just trying to override it from the outside without changing the source code

marijn · February 23, 2024, 5:26pm

language.configure({props: [styleTags(...)]}) should return a new language with the given tags added.

yogevyuval · February 23, 2024, 5:50pm

This is exactly what im doing but it’s not working. Could it be because im trying to override an existing styletags? or because the tags im changing are inside the inner language?

yogevyuval · February 23, 2024, 6:08pm

From what I see I need to configure the inner language of the sql-lang, but since it’s readonly and the constructor is private im not sure i can access it. Any workaround?