closed Could we add syntax diagrams?

+3 votes
asked Sep 21, 2022 in Wanted features by Todd Musheno (2,680 points)
closed Dec 5, 2022 by Todd Musheno

I would like to be able to generate syntax diagrams (also known as railroad diagrams):

https://en.wikipedia.org/wiki/Syntax_diagram

I would suggest starting with vanilla EBNF, as its the most widely used, and for most cases this should be "good enough".

A good site that can do this well is:

https://bottlecaps.de/rr/ui

Discussion might be good if you should support a whole syntax tree, or just one rule at a time.

I would LIKE to see support for a full syntax, but think single rule may be enough.

closed with the note: Good enough for beta
commented Nov 8, 2022 by Todd Musheno (2,680 points)

Outside documentation...

with https://forum.plantuml.net/16781/allow-special-sequence-management-special-sequence-symbol

I think we are now good for public "beta" release.

1 Answer

0 votes
answered Sep 21, 2022 by plantuml (294,960 points)

Sure, having support for https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form would be nice.

For example, we could have:

@startebnf
letter = "A" | "B" | "C" | "D" | "E" | "F" | "G"
       | "H" | "I" | "J" | "K" | "L" | "M" | "N"
       | "O" | "P" | "Q" | "R" | "S" | "T" | "U"
       | "V" | "W" | "X" | "Y" | "Z" | "a" | "b"
       | "c" | "d" | "e" | "f" | "g" | "h" | "i"
       | "j" | "k" | "l" | "m" | "n" | "o" | "p"
       | "q" | "r" | "s" | "t" | "u" | "v" | "w"
       | "x" | "y" | "z" ;
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
symbol = "[" | "]" | "{" | "}" | "(" | ")" | "<" | ">"
       | "'" | '"' | "=" | "|" | "." | "," | ";" ;
character = letter | digit | symbol | "_" ;
 
identifier = letter , { letter | digit | "_" } ;
terminal = "'" , character , { character } , "'"
         | '"' , character , { character } , '"' ;
 
lhs = identifier ;
rhs = identifier
     | terminal
     | "[" , rhs , "]"
     | "{" , rhs , "}"
     | "(" , rhs , ")"
     | rhs , "|" , rhs
     | rhs , "," , rhs ;

rule = lhs , "=" , rhs , ";" ;
grammar = { rule } ;
@endebnf

The drawing part is easy.

The most complex part is parsing EBNF.
Unfortunately, we cannot find any easy to integrate and free Java EBNF parser and we don't really have time to write such a parser.

However, if anyone with Java skills is okay to write such a parser, we would be glad to implement the drawing part!

commented Sep 22, 2022 by Todd Musheno (2,680 points)
Good news, if you could detail what you need precisely I am a Java developer, and can work on it over the weekends!

What are you using as a parser at the moment for other file types?

Any help would be... helpful ;-)
commented Sep 22, 2022 by Todd Musheno (2,680 points)

Also any reason not to do @startsyntax?

I would think EBNF would be a good starting place, as just about everyone in that space supports at least that...

There are a couple exceptions from COBOL days, but I would just force those people to convert there BNF/more restrictive diagrams to EBNF... good news the exceptions are all less rich with different syntax to their syntax documents (I hope that sentence makes sense)

There are some syntax types that may need more, but for now I would suggest that's out of scope, and they are all extensions of EBNF.

commented Sep 23, 2022 by plantuml (294,960 points)

This is really an alpha version but we have made a try.

@startebnf
character = letter | digit | symbol | "_" ;
@endebnf

If you are curious, you can have a look on the code.

Right now, only "alternation" is working.

Any though?

commented Sep 23, 2022 by Todd Musheno (2,680 points)

Looks great!

You might want to provide some way to visually distinguish between terminal strings and identifiers on the rhs (right hand side), and I am not sure if you want to be able to do full grammers or just one rule at a time.

Also I am guessing people will eventually want full ebnf support at least, so you may want to figure out how to distinguish between types visually (terminals, identifiers, optionals, comments, etc...)

The stuff beyond ebnf will all be additional types in exactly this sense, but you are looking at a handful of types, so letters or something may suffice.

Also I would simply list each rule one at a time... its common for syntaxes to be recursive, so trying to unroll horizontally... that way leads madness. (also, not sure if you want to be able to switch between horizontal and vertical display, but... I do not see why one could not do that).

https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form#Table_of_symbols

commented Sep 23, 2022 by Todd Musheno (2,680 points)
One minor point...

your diagram should at least indicate its for the "character" rule given that input...

If you look at the example on wikipedia you will see a "header" for each rule, so if you only have one rule you will still need to know the "identifier" for the rule.
commented Sep 23, 2022 by Todd Musheno (2,680 points)

You also may want to add something like an incoming arrow ▶/⮞/etc... and outgoing circle ⚫/⚪/etc... to indicate direction... I don't want to give too much detail on exact symbol/visual you use though... seen this a million ways, I think consistant with other diagams would be more important then whatever I think.

minor point

commented Sep 26, 2022 by Todd Musheno (2,680 points)
Looked over the code this weekend...

There is some stuff I don't 100% understand, but think its all UI related.

Outside that, it all seems reasonable to me.
commented Sep 27, 2022 by plantuml (294,960 points)

We have made some progress. :-)

I do have a question about repetition { ... } .

I seems that this repetition means "0 or more" (that is, the group may be empty).

However, I find that the graphical representation is confusing. It looks like "1 or more" :

Do you see what I mean?

commented Sep 28, 2022 by The-Lu (63,920 points)
edited Sep 28, 2022 by The-Lu

Hello PlantUML,

I do have a question about repetition { ... } .

Instead of ["1 or more"]:

         ┏━━━━━━┓    
┣━━━━━╭━━┫ char ┣━━╮━━━━━━>
      ┃  ┗━━━━━━┛  ┃ 
      ╰━━━━━━<━━━━━╯

However, I find that the graphical representation is confusing. It looks like "1 or more" :
Do you see what I mean?

Yes, that is the reason why some other tools use instead ["0 or more"]:

┣━━━━╭━━━━━━━━━━━━━━╮━━━━>
     ┃   ┏━━━━━━┓   ┃
     ╰━━━┫ char ┣━━━╯
         ┗━━━━━━┛

         ┏━━━━━━┓
     ╭━━━┫ char ┣━━━╮ 
     ┃   ┗━━━━━━┛   ┃
┣━━━━╰━━━━━━━━━━━━━━╯━━━━>

   ╭━━━━━━━━>━━━━━━━━━╮ 
   ┃     ┏━━━━━━┓     ┃
┣━━╯━━╭━━┫ char ┣━━╮━━╰━━━> 
      ┃  ┗━━━━━━┛  ┃  
      ╰━━━━━━<━━━━━╯

      ╭━━━━━━━━<━━━━━━╮ 
      ┃   ┏━━━━━━┓    ┃
┣━━╮━━╰━━━┫ char ┣━━━━╯━━╭━━━> 
   ┃      ┗━━━━━━┛       ┃  
   ╰━━━━━━━━━>━━━━━━━━━━━╯

See example of `new-line` or `integer` on:

Or `Grammar` or `StringLiteral`on:

If that can help,
Regards,
Th.

commented Sep 28, 2022 by Todd Musheno (2,680 points)

Everything goes from left dot (start of rule) to right dot (end of rule)...

To read, put your finger on the left dot, then without picking up your finger trace your way to the right dot, any path you take is a valid use of the syntax... if you get to a rule, you would when getting to it, jump to the other rule, and same process, popping back to where you started when done (think subroutine)

So given the diagram you posted: you must pass through the left quote, then the "character", but then you have a choice you can:

  • continue to the right quote
  • OR follow the other line, and go back to character

I hope that makes sense on why that is a 1 or more...

To encode 0 or more characters, there should be a line between the 1 quotes that does not pass though the "character"... I hope that helps.

So yes, given the diagram you posted... that is a 1 or more.

0 or more would have a line going from left to right quote without going through character (hence 1 characters), so if you drop the "character" from the top line to the bottom, you would get a 0 or more.

commented Sep 28, 2022 by Todd Musheno (2,680 points)
TH has that spot on... refer to his diagram!
commented Sep 28, 2022 by plantuml (294,960 points)

> I hope that makes sense on why that is a 1 or more...
>

Sure, 100% agree: the drawing means "1 or more"

My issue is that when you are writing:

identifier = letter , { letter | digit | "_" } ;

I'm pretty sure that the repetition group here means "0 or more". Otherwise, it would mean that an identifier cannot have a single letter.

Agree?

commented Sep 28, 2022 by Todd Musheno (2,680 points)

Disagree: that says a letter, followed by zero or more of any of { letter, digit, or a single underscore character }

so valid:

  • aaaaaa
  • a1b2
  • d3_4____5bbb

invalid:

  • empty string
  • 1234abc
  • a1_!

Hope that helps

commented Sep 28, 2022 by Todd Musheno (2,680 points)
edited Sep 28, 2022 by Todd Musheno
The stuff in the curly braces is 0 or more... so a single letter is required.
commented Sep 28, 2022 by Todd Musheno (2,680 points)
And yes "a" would be a valid identifier!
commented Sep 28, 2022 by plantuml (294,960 points)

And yes "a" would be a valid identifier!

So my drawing of:

@startebnf
identifier = letter , { letter | digit | "_" } ;
@endebnf

is invalid because "a" does follow this simple rule.

And my corresponding drawing is:  

This drawing does not work for "a".

commented Sep 28, 2022 by Todd Musheno (2,680 points)
No...

To match the letter "a" you would need a line going from the left letter to the directly to the end...

That is almost correct, the bottom line is needed to show you can have more then one of the choice.

You also need another line indicating you can skip the choice (where the line curves the other way)
commented Sep 28, 2022 by Todd Musheno (2,680 points)
Some differentiation between start and end would be helpful from a visual perspective as well.
commented Sep 28, 2022 by Todd Musheno (2,680 points)
The character vs identifier visual looks great IMHO btw.
commented Sep 28, 2022 by Todd Musheno (2,680 points)
The header also looks fine, not 100% sure you need the colon ":", but seen it both ways in the wild
commented Sep 28, 2022 by The-Lu (63,920 points)
edited Sep 29, 2022 by The-Lu

Hello all,

After some search, I go on this beautiful comparative:

Then on ISO EBNF:

CodeMeaning
{...}Repetition: 0 or more
{...}-Repetition: 1 or more
[due to: syntactic-term with an empty syntactic-exception.]
wink
n*...Repetition: n times

Then '{...}' is right '0 or more'...

@PlantUML: is it plan to manage 'n*...' on PlantUML? cheeky

Have a good night...
Regards.

commented Sep 28, 2022 by The-Lu (63,920 points)

Hello PlantUML team,

Here is a test (on V.1.2022.9):

@startebnf
fieldlist = field, {fieldsep, field}, [fieldsep];
@endebnf

  • Why the order and group is not respected?
Thanks for the improvement, if not already corrected on V10...

Then an open question... from BNF to EBNF or ABNF:

  • Could you allow space (' ') to be a concatenate symbol (as ',')?
    or would you be only be [strictly] conform to ISO EBNF, without BNF inheritance.
Regards,
Th.
commented Sep 29, 2022 by Todd Musheno (2,680 points)
allowing ' ' seems reasonable to me, but not required.
commented Sep 29, 2022 by Todd Musheno (2,680 points)

That diagram seems odd to me, I do not think its correct given the input.

I would START with ebnf... if you wish to add stuff (eg: n* style repetition), I would say that's fine, I am more concerned with being able to generate the diagrams, and most people in that area are familure with EBNF (its kinda the standard)

Having said that... yes, if you require 17 'a's the strict ebnf will be just aweful, but outside ebnf everyone has there own views, so I would feel free to "augment" ebnf as you like, just be sure to document everything outside ebnf.

@startebnf

bunchoas = 'a', 'a', 'a', 'a', 'a', 'a', 'а', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a';

@endbnf

Can you catch the issue (look at the 7th 'a')

commented Sep 29, 2022 by Todd Musheno (2,680 points)
Be careful ebnf is a lot differant than bnf ()s are grouping in ebnf, and repition in bnf.

I would stick to ebnf, there do exist grammers not representable in bnf, but are in ebnf.
commented Sep 29, 2022 by plantuml (294,960 points)
  • Let's stick to ebnf right now, so no support for "space" as separator
  • ...but we are now supporting }- since it was very easy to implement
We have fixed several parsing bugs and improved display. I think we have something usable now :-)
@startebnf
string_maybe_empty = "'" , { character } , "'"
       | '"' , { character } , '"';
       
string_non_empty = "'" , { character }- , "'"
       | '"' , { character }- , '"';
       
       
@endebnf
As usual, all comments are welcome, this is not a final version.
(especially if you find issues)
Thanks
commented Sep 29, 2022 by Todd Musheno (2,680 points)
I think you are 90% there!
commented Sep 29, 2022 by Todd Musheno (2,680 points)
Love the start/end dots!
commented Sep 29, 2022 by The-Lu (63,920 points)

Good improvement... but I will have some remarks... wink

I have a little trouble with the double paths...

I will test and I come back to you...

commented Sep 29, 2022 by The-Lu (63,920 points)
edited Sep 29, 2022 by The-Lu

Why:

I have a little trouble with the double paths...

Here is an example:

@startebnf
OppositeDirection = {'a', 'z'};
@endebnf

With the new design:

  • "" matches if you go straight 
  • "az" matches if you go one time on right
  • "azaz..." matches if you go on right on the loop
  • But "za" matches also if you go straight and take the loop on the opposite direction! blush

How to make a 'One Way' for the bottom loop...

Perhaps you would have to go down the loop of the first straight line...

┣━━╮━━━━━━━>━━━━━━━╭━━━> 
   ┃               ┃
   ╭━━━━━━━<━━━━━━━╮
   ┃   ┏━━━━━━┓    ┃
   ╰━>━┫ char ┣━>━━╯
       ┗━━━━━━┛   

Thanks for new improvement...

Regards,
Th.

commented Sep 29, 2022 by The-Lu (63,920 points)

Why not to put the backward loop on the bottom?
On below of the data, as:

┣━━╮━━━━━━━>━━━━━━━━╭━━━> 
   ┃    ┏━━━━━━┓    ┃
   ╰╭━>━┫ char ┣━>━╮╯
    ┃   ┗━━━━━━┛   ┃
    ╰━━━━━━<━━━━━━━╯

Or why not to use (with recycling your work with '['/']' option, adding just a loop on below):

   ╭━━━━━━━━>━━━━━━━━━╮ 
   ┃     ┏━━━━━━┓     ┃
┣━━╯━━╭━━┫ char ┣━━╮━━╰━━━> 
      ┃  ┗━━━━━━┛  ┃  
      ╰━━━━━━<━━━━━╯

If that can help, to debate with other...

Regards.

commented Sep 30, 2022 by Todd Musheno (2,680 points)
I think thats an exhaustive list of options... good work th.

I like the last diagram best, I personally think it screams optional loop.
commented Sep 30, 2022 by Todd Musheno (2,680 points)
The more I think about it...

I would have optionals go one direction and loops the other.

Optionals on top loops on bottom is common, but as long as its consistant, I think you are fine.
commented Oct 1, 2022 by The-Lu (63,920 points)

Here is also an overlapping issue:

@startebnf
overlapping = [[ '+'|'-' ], {int}-];
@endebnf

Regards.

commented Oct 3, 2022 by plantuml (294,960 points)

We've made several improvement with last online version.

We've temporary added a new pragma so that you can compare:

and

It's a question of taste but we find that there are too many lines in the second option and that the first drawing is nicer.

We let you all play with this so that you can tell us what you think about it :-)

commented Oct 3, 2022 by The-Lu (63,920 points)

We let you all play with this so that you can tell us what you think about it :-)

Good improvement... yes

Here are some tests before the night...

1. Could you make a syntax error for this example:

@startebnf
test = [[[[[['a'|'b'];
@endebnf

2. Could you improve the SVG output (on the inflection point of the up path), compare:

Code
@startebnf
test = ['a'];
@endebnf
@startebnf
test = ['a'|'b'];
@endebnf
Output
Inflection point of the up pathOKKO (line overlapping, on the two inflection points of the up path)
But scale with a 400% factor, to see this minor issue... (on SVG output)
See also the arrows (seems doubled on 400% factor)...

3. Could you manage repetition `n*...` syntax (conform to the ISO EBNF)?

@startebnf
test = 4 * '2'
@endebnf

See `repetition-symbol` on §2 of convention on:

4. More and more: Could you plan to add commons elements on `ebnf` diagram (scale, title, legend, style,...)? wink

I go back to test....
Regards,
Th.

commented Oct 3, 2022 by Todd Musheno (2,680 points)

Looking good...

although the arrows are uncommon, I think it adds to the readability.

Here are a couple real world examples to test against:

I think if those come out ok, you are probably "good enough" for a first release!
commented Oct 3, 2022 by The-Lu (63,920 points)

[FYI: Test to improve/confirm the implementation]

Adapted form your second link:

Here is an attempt:

@startebnf
s_expression = atomic_symbol | "(", s_expression, ".", s_expression, ")" | list;
list = "(", s_expression, '<', s_expression, '>', ")";
atomic_symbol = letter, atom_part;
atom_part = empty | letter, atom_part | number, atom_part;
letter = "a" | "b" | "..." | "z";
number = "1" | "2" | "..." | "9";
empty = " ";
@endebnf

And the result:

There is a question about '<' & '>', but I don't know Lisp well...

commented Oct 4, 2022 by Todd Musheno (2,680 points)

Those are literal < less then and > grater than characters...

Your diagram seems spot on for vanilla lisp best I can tell... the "..." is badly documented on the website, but that's no fault of the diagram (gigo).

This brings us to a great example of things beyond ebnf:

"number" is really just 0-9... there is no consistent way to shorten this and is not a part of any standard, so that would be a nice to have, but for first cut, I think we should stick to the standard... I have seen allowing regexes in place of characters in the wild a lot, but there seems to be no standard on syntax, everyone seems to do it there own way, and explain there approach in there document.

I have seen at least the following styles:

  • number = [0-9];   // these first 2 I have seen also match multiple characters, same syntax, different documents and authors.
  • number = {0-9}
  • number = ^[0-9]$
  • numbers = ^[0-9]+$
  • number = /[0-9]/
  • numbers = /[0-9]+/
  • number = {{regex: [0-9]}}   // also seen where the number of curly braces can be any number of characters
  • number = ${regex: [0-9]}
  • number = {any decimal digit}      // and other similar styles where the actual character values are just described.

As you can tell, many ways to skin that cat.

You will probably get multiple requests similar to this kind of thing after 1st release, so be prepared.

commented Oct 4, 2022 by Todd Musheno (2,680 points)
Repetition shortcuts are another example of non standard syntax thats just all over the place (again, tackle this later).

Basically it covers ranges (example: 5 to 50 times) with non standard syntax and diagrams
commented Oct 4, 2022 by Todd Musheno (2,680 points)
The last edge case I can think of off hand:

Comments: I would suggest you add some way to add notes to things, but again I would suggest this as after a first cut.
commented Oct 4, 2022 by plantuml (294,960 points)
About ranges (5 to 50 times) and notes, it would be nice to open two new different questions/topics: this one is too long now :-)

And it would be nice if you could give examples with some short diagram texts (because the syntax is not clear to me) and some link to actual images (because we don't know how to draw this).

Thanks for your ideas!
commented Oct 4, 2022 by Todd Musheno (2,680 points)
I think you are golden outside the lines in terminals overlapping so its hard to tell what's going out/in.

Basically make this diagram not have overlapping lines:
https://www.plantuml.com/plantuml/uml/SoWkIImgIKtAI-CgIYrGi5MeJgwrvd98pKi1YG40
commented Oct 6, 2022 by Todd Musheno (2,680 points)
Identifier and Terminal are not readable still.
commented Oct 6, 2022 by plantuml (294,960 points)

> Identifier and Terminal are not readable still.

Sorry, I don't understand what "not readable" means. Could you post an example? Thanks!

commented Oct 6, 2022 by Todd Musheno (2,680 points)
https://www.plantuml.com/plantuml/uml/SoWkIImgIKtAI-CgIYrGi5MeJgwrvd98pKi1YG40

You cant tell what is going in/out of c as the loop line is over the cut through line...

This should have a line cut through c, and one the loops back...

They should be distinct. So one can follow the lines.
commented Oct 6, 2022 by Todd Musheno (2,680 points)
I think your looping back line is going over your continue through line?
commented Oct 6, 2022 by The-Lu (63,920 points)

Hello PlantUML team, and all,

Good improvement: with style and commons management. yes

For the pragma and the different layouts...
That depends of taste. But I prefer the expanded form (more explicit, ... and I understand the collapse form)

@startebnf
!pragma theo
rep = {"a", c , "a"};
@endebnf

- Why not to put the loop below of the label `a, c, a`?

I appreciate to have a pragma to my first name! But.... wink

- Could you kept and rename `theo` pragma to:

  • ebnfexpanded or ebnf_expanded or ebnf expanded
  • ebnflegacy or ebnf_legacy or ebnf legacy
  • or other words... (Today, I have not other ideas)

- Is that make sens?


Then, for the other wanted features:
ISO EBNF:
- [ ] Allow repetition symbol "*..."
- [ ] Allow full restriction management except symbol "...-" [perhaps with a 'not' management]

Not ISO EBNF:
- [ ] Allow space as separator

PlantUML ecosystem:
- [x] Allow Commons (title, legend, ...) [management with the last version]
- [x] Allow Scale [done]
- [x] Allow `skinparam handwritten true` [done]
- [ ] Allow Creole (I dont know if is good or not...to not interfere with ebnf syntax!)

@startebnf
rep = {"<color:red>c</color>"};
@endebnf


- [ ] Allow Sub-diagram

@startuml
component A [
{{
state a
}}
{{ebnf
rep = {c};
}}
]
@enduml


- [ ] Allow Style [partial management with the last version]

@startebnf
<style>
element {
  ebnf {
  Fontcolor blue
  Backgroundcolor palegreen
  }
}
</style>
rep = {"a", c , "a"};
@endebnf


How to color the string background?


Finally:

  • How could we help you?
  • Is it necessary to open new Wanted Feature Request?

Thanks for yours works,
Regards,
Th.

commented Oct 6, 2022 by plantuml (294,960 points)

I think your looping back line is going over your continue through line?

Yes, exactly.

The two following images (you can click on it) have the same meaning.

  means the same as 

I understand that the first one is more standard. However, people do interpret both images as zero or more 'c'

The second image is more compact, there are less horizontal lines. So in complex construction, it may be easier to read.

commented Oct 6, 2022 by plantuml (294,960 points)

- Why not to put the loop below of the label `a, c, a`?

The issue with "loop" below is that when you have a complex construction like in

it makes a very long back loop which is not very readable. (We can make another !pragma to show you if you want)

>I appreciate to have a pragma to my first name! But....

Yes, this will be renamed but we are very out-of-inspiration right now. Let's first find the default rendering. We'll see if we keep those pragmas later.

> How could we help you?

Maybe we can now work here http://alphadoc.plantuml.com/doc/dokuwiki/en/ebnf-discussion

This thread is really too long :-)

commented Oct 7, 2022 by Todd Musheno (2,680 points)
The second image is basically not readable, as the lines overlap IMHO.
commented Oct 7, 2022 by Todd Musheno (2,680 points)
I think we are golden for first draft if you make the theo thing the default.
commented Oct 11, 2022 by Todd Musheno (2,680 points)
With theo, and notes done, we are down to notations before documentation.

Although restrictions are a nice to have, for me its not a requirement... others milage may vary.

I think notions will be an easy add.
asked Oct 11, 2022 in Wanted features by The-Lu (63,920 points)
edited Oct 12, 2022 by The-Lu
EBNF - Allow full repetition management with repetition-symbol "*"
asked Oct 16, 2022 in Wanted features by The-Lu (63,920 points) EBNF - Allow empty definition
asked Dec 4, 2022 in Wanted features by The-Lu (63,920 points) EBNF - Perserve the order of element
...