Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Cwerg: C-like language that can be implemented in 10kLOC (github.com/robertmuth)
68 points by nateb2022 on March 26, 2024 | hide | past | favorite | 50 comments


Nice name :)

(For all the non-German speakers: "Zwerg" means "dwarf" and "c" is pronounced the same as "z".)


C-like and S-expressions? I guess my definition of a C-like language is a bit off.


They aren't incompatible.

My own toy language which is intended to be a C replacement for myself is prototyped entirely in s-expressions.

What else would you use to represent a syntax tree?


Also with some tweaking you can push s-expr pretty far WRT readability. Here is an example of a cwerg program using "enhanced" s-exprs:

https://github.com/robertmuth/Cwerg/blob/master/FrontEnd/Tes...

And here is same the same program in the tentative concrete syntax:

https://github.com/robertmuth/Cwerg/blob/master/FrontEnd/Con...


I had skimmed through the readme and then went straight to the example folder so missed the fact that this is not supposed to be the final concrete syntax, makes much more sense now!


Honestly I'm doing pretty much the same thing; been through several iterations of AST without yet deciding on a syntax, other than "Easy for C Programmers" and "Doesn't look like someone ate alphabet soup and vomited it all over the screen".


> What else would you use to represent a syntax tree?

For unit tests I've used pretty-printed JSON. Text editors syntax highlight it plus you can leverage an off-the-shelf JSON library rather than writing your own s-expressions serializer (not that serializers are difficult to write or anything, but just as a convenience).


> What else would you use to represent a syntax tree?

Different syntax for different concepts, easier for the eye, quicker the understanding, troubleshooting.


myself i'm partial to rose trees

rather than

    (set (aref a i) (+ i j))
you have

    set(aref(a i) +(i j))
and no difference between i and i(), and only one type of node, with a tag and zero or more kids, instead of separate node types for having kids and having tags

the rose tree model feels like a slightly closer fit to the needs of abstract syntax trees, and although it isn't simpler than sexps, it isn't more complicated either

another possibility is the ml approach where juxtaposition denotes function application but the functions are curried so they only ever take a single argument, which usually looks exactly the same as sexps but conceptually associates the other way

    set (aref a i) (+ i j)
the only difference is that this is equivalent, which i think is worse in this context:

    set ((aref a) i) ((+ i) j)
(note that ml only uses this approach for expressions to evaluate. for data, such as asts, it uses the rose tree approach)

regardless, the semantics are a lot more important than the syntax


Apparently the actual surface syntax is not yet decided and the S-expression abstract syntax is currently used as a stand-in.


That is correct. The concrete syntax will be python-like. Somer examples are here:

https://github.com/robertmuth/Cwerg/tree/master/FrontEnd/Con...


Precedent for Python-like but C-level language in Nim; will be interested to see how yours develops too. Good luck!


IIRC, wasn't this originally also the case for Lisp?


I wonder how hard it would be to keep the S-expression syntax when the new one is implemented. As a C and Lisp programmer I'm intrigued by the possibilities - C-family syntax being IMO more readable, but Lisp providing unparalleled (in my experience) metaprogramming possibilities.


I'm also a Lisp and C programmer (want to get better at Forth too).

I think Lisp and C (not C++) share an important characteristic that few other languages share: being low-level.

C is low-level due to minimal abstractions from the hardware.

Lisp is low-level due to minimal abstractions from the AST.

People wanting to program very close to hardware feel comfortable with C. People wanting to program very close to the compiler feel comfortable with Lisp.


I am planning to make the s-expr form the on disk format.


Does that mean you are planning a specific editor for it?


I was thinking more of some light weight tools that convert between the two representations and which can be hooked into existing editors.


I see. And this led me to a question: what can make a language expressed in S-expressions not a lisp?


Any language can be expressed in S-expressions. A translation of the syntax tree to sexprs is often used in parser debugging. And Lisp doesn’t have to be expressed in S-expressions (see Dylan, or M-expressions, a syntax for Lisp that predates S-expressions).

Lisp traditionally has these features, but they’re separate features:

* There is a rich syntax for literal data structures

* Code is directly exposed as data structures so it can be manipulated with code (macros)

* Code is written as literal data structures so it has no separate syntax


> Lisp traditionally has these features

Lisp has traditionally a two stage syntax

1) S-expressions

2) Lisp

S-Expression level has a syntax to describe data: lists, conses, numbers, symbols, strings, arrays, ...

Lisp syntax is defined on top of that, as s-expressions: variables, function calls, macro calls, special forms (using quote, let, if, progn, setq, catch, throw, labels, flet, declare, ...), lambda expressions. Each macro also can implement syntax.


Thanks. Can you please elaborate some bits on what is meant under "rich syntax for literal data structures"? I'm not a lisp pro, just started studying it.


That's nothing special of Lisp. Literal data structures means that one can write down data objects like list, arrays, vectors, numbers (floats, integers, rational, complex, ...), symbols, strings, characters, records/structures, ...

Common Lisp for example:

    (berlin hamburg munich cologne)   ; a list of symbols
    #(berlin hamburg munich cologne)  ; a vector of symbols
    (1.3 1.3d6 13 #c(1 3) 1/3)        ; a list of numbers
    "Hello World!"                    ; a string
    #\H                               ; the character H
    #S(CITY :NAME BERLIN              ; a structure (aka record) of type CITY
            :COUNTRY GERMANY)
    #*1010101111                      ; a bitvector
    #2A((BERLIN GERMANY)              ; a 2d array
        (PARIS FRANCE))
Other programming languages have their own syntax for literal data (even data structures like unicode characters, hash tables, unicode strings, decimal numbers, ...) Common Lisp OTOH has an extensible reader, one can add new syntax extensions for data structures, using so-called reader macros. READ is the function to read s-expressions from text streams. It returns data objects.


> Any language can be expressed in S-expressions.

I'm not sure this is true for the C preprocessor, though, where macros can represent partial structure.


People have come up with S-expression syntaxes for C, basically to be able to write C macros in lisp. Here's one: https://voodoo-slide.blogspot.com/2010/01/amplifying-c.html


Couple of others:

c-mera: https://github.com/kiselgra/c-mera

cmacro: https://github.com/eudoxia0/cmacro

(This one one is implemented in Common Lisp for its semantics, but doesn't use a S-exp surface syntax for the code.)

sxc: https://github.com/burtonsamograd/sxc

(incomplete)

MetaC: https://github.com/mcallester/MetaC

(References Lisp in readme; doesn't use it for implementation or notation, but references ideas. Source code seems to be a core of .c files, and the rest self-hosted in its own .mc language. Somehow provides a REPL.)


Ah, now I see. Thx.


My thoughts too. I was expecting something between Go and Rust in syntax, maybe more terse.


That was also my first thought after looking at the examples.


For what it's worth you can implement a C compiler in under 10kLOC. Chibicc is only a few thousand lines [1]. There is also Cake [2] and tinycc [3] which are both relatively small.

[1] https://github.com/rui314/chibicc

[2] https://github.com/thradams/cake

[3] https://bellard.org/tcc/


The title omits a crucial word.

Cwerg aims to be the best c-like language that can be implemented in 10kLOC. Obviously, best is highly subjective but I want to improve on C not just re-implement it.


Author here - happy to answer questions


> Author here - happy to answer questions

What's the goal with this? I mean, where are you going with this?

Research language? Scratching an itch? Learning exercise?

All good answers, IMHO.

But ... is there a real gap in the needs of programmers that Cwerg is attempting to address? If there is, can you explain a little more the actual gap being addressed?


Cwerg started off as a "Covid side project" with no pericular goal other than seeing how far you can push something that is simple enough be maintained by a single person. Oberon was an inspiration and I'd like to make Cwerg self hosted and maybe work on an OS written in it some day.

In term of features, Cwerg is roughly in the same space as Odin, Zig, Hare, etc. (see https://github.com/robertmuth/awesome-low-level-programming-...) But I am much more willing to sacrifice compatibility for simplicity, e.g. no shared libs, no varargs, no linking with non-Cwerg code etc.

I also feel that compilation speed has not received as much attention in the compiler space as it deserves. Go-lang was one of the first to highlight this recently.


I too am curious about this. I've seen other languages with similar features similar such as Odin and C3 but they used llvm and now Odin is trying to use TildeBackend (from cuik c compiler) as an alternative. What i'm really interested in, is if the risc-like IR is a alternative to llvm.


This section contains a list of features that are unlikely to be added to the Cwerg backend:

https://github.com/robertmuth/Cwerg?tab=readme-ov-file#inten...

If you can live without them, it can be a replacement for LLVM.


Looking over the Readme, this sounds a lot like Zig. What languages have you taken the most inspiration from?


Hard to say. I created this for inspiration:

https://github.com/robertmuth/awesome-low-level-programming-...


Thanks for sharing. Good work. It seems you've done a good amount of work on this, with the multiple backends.

From your other comments here it seems your emphasising "understandability by one person", which Oberon as you mentioned was designed to be understandable.

It reminds me of Taylor Troesh's wigwams

https://taylor.town/pardon-2023#wigwams

I need to document my JIT compiler's design which is really straightforward.

I've been loosely reading qbe's sourcecode but I need to go through the bibliography to understand the code more. At the moment it's all unfamiliar and not understandable.


I love projects like this, but why S-expressions? Optimizing your frontend for the computer to parse (instead of the human) is the wrong tradeoff IMO unless you really really need an ultra-powerful macro language.

Also, I'd call it "C runtime" not "C-like" of you're not going to have C-like syntax

Just my 2c


The concrete syntax is WIP and will be Python-like. I posted some links early so you can see what it looks like. I do plan on having the s-expr be the on-disk format, though.

You are right in that the surface syntax will not be c-like but the features (or lack of them) will be.


This was submitted four days ago. But the comments and submission are less than six hours old? I also remember reading these comments. Got some major deja vu (spelling).


That's HN's "second chance" feature. The relative timestamps are faked to look new so the discussion doesn't look stale. The real time stamps are still displayed on hover / in the alt text. I'm not sure why this was second chanced four days later considering that it spent a fair amount of time on the front page. Maybe a misclick.

e: misremembered, it apparently wasn't on the front page: https://hnrankings.info/39786663/


Merely passing by to note an entire compiler, editor, IDE and OS implemented in under half that many LOC.

http://www.projectoberon.net/


Cwerg tries to hit a different sweet spot: It has an intentionally bigger LOC/complexity budget to allow for more features and better performance.

Instead of having the whole system be understandable by a single person, each major component should be.

In fact the 10kLOC applies to the frontend and each backend separately which I think is fair as most compiler writers use off the shelf backends like QBE, LLVM or even C.


All right.

I am no expert in compilers, but how does it compare to, say...

• TinyC – https://bellard.org/tcc/

• Small-C – https://en.wikipedia.org/wiki/Small-C

• Smaller C – https://github.com/alexfru/SmallerC

…?


The title omits a crucial word.

Cwerg aims to be the best c-like language that can be implemented in 10kLOC. Obviously, best is highly subjective but I want to improve on C not just re-implement it.


Or, for that matter, the original 70's K&R compiler, right?

It's probably at the right size for exploring some language ideas.


The modules in this site indeed are less than 10kSLOC (about 9070 SLOC), but we have to consider that in Oberon - in contrast to C - usually many statements are written on the same line; there seem to be at least two statements in each line; three to four statements seem to be common. So you can multiply the SLOC by at least two when comparing with a common C program. And I neither would call it an IDE, since it lacks most features which we associate with IDEs since the late nineties (e.g. no source-level debugger, no syntax coloring or cross-referencing, etc.).


Fair. :-)

It was a somewhat facetious comment, TBH.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: