Ideas on how lexing will work in Pepper3

I am trying to practice documentation-driven development in Pepper3, so every time I start on an area, I will write documentation explaining how it works, and include examples that are automatically verified during the build.

I’ve started work on lexing, since you can’t do much before you do that, but in fact, of course, I need to have a command line interface before I can verify any of the examples, so I’m working on that too.

Lexing is the process that takes a stream of characters (e.g. from a file) and turns it into a stream of “tokens” that are chunks of code like a variable name, a number or a string. (There is more on lexing in my mini programming language, Cell.)

My thoughts so far about lexing are in lexing.md, and current ideas about command line interface are at command_line.md. All very much subject to change.

Headlines:

  • Ordinary programmers can write their own lexing rules.
  • Operators (functions like “+” that find their arguments on their left and right, instead of between brackets like normal functions) are defined at the lexing phase, so any symbol (e.g. “in”) can be an operator if you want.
  • Anything you might want to do with a pepper program, including running it, compiling it, packaging it for an distribution system, should be available as a sub-command of the main pepper3 command line.
  • The command is “pepper3”, never “pepper”. If a new, incompatible version comes out, it will be called “pepper4”, and they will be parallel-installable, with no confusion.

Questions and answers about Pepper3

Series: Examples, Questions

My last post Examples of Pepper3 code was a reply to my friend’s email asking what it was all about. They replied with some questions, and I thought the questions and answers might shed some more light:

Questions!

Brilliant ones, thanks.

In general though you’ve said a lot about what Pepper can do without giving design decisions.

Yep, total brain dump.

Remind me again who this language is for :)

It’s a multi-paradigm (generic, functional, OO) language aimed at application programmers who want:

  • “native” performance on their chosen platform (definitely including actual native machine code). This is inspired by C++.
  • easy deployment (preferably a single binary containing everything, with an option to link most dependencies statically), including packaging of installers for major OSes. This is inspired by C++, and the pain of C++.
  • perfect flexibility for creating types – “meta-programming” is just programming. Things you would have done using code generation (e.g. generating a class hierarchy from an XSD) are done by running arbitrary code at compile time. The powerful type system is inspired by Haskell and the book “Modern C++ Design”, and the meta-programming is inspired by Lisp.
  • Simple memory management without GC through ownership. This is inspired by modern C++, and then Rust came along and implemented it before I could, thus proving it works. However, I would remove a lot of the functionality in Rust (lifetimes) to make it much simpler.
  • Strong support for functional programming if you want it. This is inspired by Haskell.
  • The simplest possible core language, with application programmers able to expand it by giving them the same tools as the language designers – e.g. “for” is just a function, so you can make your own. I am hoping I can even make “class” a function. This is inspired by Lisp, and oppositely-inspired by Java.
  • Separation between the idea of Interfaces, which I think I will call “type specifiers” (and will allow arbitrary code execution to determine whether a type satisfies the requirements) and structs/classes, allowing us to make new Interfaces and have old code satisfy them, meaning we can do generic stuff with e.g. ints even if
    no-one declared that “class Int : public Quaternion” or whatever.
  • Lots of “nudges” towards things that are good: by default things will be functional and immutable – you will have to explicitly say if you want to use more dangerous constructs like side effects and mutable values.
  • No implicit conversions, or really anything happening without you saying so.

Can you assign floats to ints or vice versa?

Yes, but you shouldn’t.

If you’re setting types in code at the start of a file, is this only available in the main file? Are there multiple files per program? Can
you have libraries? If so, do these decide the functionality of their types in the library or does this only happen in the main file?

I haven’t totally decided – either by being enforced, or as a matter of style, you will generally do this once at the beginning of the program (and choose on the compiler command line to do it e.g. the debug way or the release way) and it will affect all of your code.

Libraries will be packaged as Pepper3 source code, so choices you make of the type of Int etc. will be reflected through the whole dependency tree. Cool, huh?

This is inspired by Python.

Can you group variables together into structs or similar?

Yes – it will be especially easy to make “value types”, and lots of default methods will be provided, that you will be strongly encouraged to use – e.g. copy and move operations. This is inspired by Elm.

Why are variables immutable by default but mutable with a special syntax? It’s the opposite of C++ const, but why that way around?

This is one of the “nudges” – immutable stuff is much easier to think about, and makes parallel stuff easier, and allows optimisations and so on, so turning it on by default means you have to choose to take the bad path, and are inclined to take the virtuous one. This is inspired by Haskell and Rust.

Why only allow assignments, function calls and operators? I’m sure you have good reasons.

To be as simple as possible, so you only have those things to learn and the rest can be understood by just reading the code. This is inspired by Python.

I wrote more of my (earlier) thoughts in this 4-post series, which is better thought through: Goodness in Programming Languages

Examples of Pepper3 code

Series: Examples, Questions

I have restarted my effort to make a new programming language that fits the way I like things. I haven’t pushed any code yet, but I have made a lot of progress in my head to understand what I want.

Here are some random examples that might get across some of the ways I am thinking:


// You code using general types like "Int" but you can set what
// they really are in the code (usually at the beginning), so
// if you plan to use native ints in the production code, it's
// a good idea to use:
Int = CheckedNativeInt;
// while in dev, since it will crash at runtime if you overflow.

// Then, in production when you're sure you have no errors,
// switch to an unchecked one:
Int = NativeInt;

// But, if you prefer correctness over efficiency, you can use
// mathematical integer that never overflows:
Int = ArbitrarySizeInt;


// Variables are immutable by default, so:
Int x = 4;
x = 3;      // this is a compile error


// But this is OK
Mutable(Int) y = 6;
y = y + x;

// Notice that you can call functions that return types that you
// then use, like Mutable(Int) here.

// Generally, code can run at either compile time or run time.
// Code to do with types has to run at compile time.
// By default, other code runs at run time, but you can force
// it to run early if you want to.


// A main method looks like this - you get hold of e.g. stdout through
// a World instance - I try to avoid any global functions like print, or
// global variables like sys.stdout.

Auto main =
{:(World world)->Int
    //...
};

// (Although note that Int, String etc. actually are global variables,
// which is a bit annoying)

// I wish the main method were simpler-looking.  The only saving grace
// is that for simple examples you don't need a main method -
// Pepper3 just calculates the expression you provide in your file and
// prints it out.


// Expressions in curly brackets are lambda functions, so:

{3};

// is a function taking no arguments, returning 3, and:

{:(Int x)
    x * 2
};

// is a function that doubles a value.

Obviously, we can tie functions to names:

Auto dbl =
    {:(Int x)
        x * 2
    };

// Meaning we can call dbl like this:
dbl(4);

// Auto is a magic word to say ("use type inference"), so
// this is equivalent to the above:

fn([Int]->Int) dbl =
    {:(Int x)
        x * 2
    };


// Because {} makes an anon function, things like "for" can be
// functions instead of keywords.

for(range(3), {:(Int x)
    world.stdout.println(to(String)(x));
});


// As far as possible, Pepper3 will only contain assignment statements:
String s = "xx";

// and expressions containing function calls and operators:
dbl(3) + 6;


// This means we can make our own constructs like a different type of
// for loop, which would need a new keyword in some languages:

Auto parallel_for = import(multiprocess.parallel_for);

What is a string?

Most programming languages have some wrinkles around unicode and strings*. In my ficticious language Pepper, there are no wrinkles of any kind, and everything is perfect.

*E.g. JavaScript, Java, Haskell, Ruby, Python.

There are several key concepts. The most important are an interface AnyString and the variable** String which is what you should use when you are writing code with strings.

**String is a variable that refers to a type, so you just use it like a type and don’t worry about it.

interface AnyString
{
    def indexable(CodePoint) code_points( implements(AnyString) string )
}

In Pepper an interface can describe what free functions exist as well as what member function a class must have, and here we just require that a code_points function exists that gives us a collection of CodePoint objects that may be indexed (i.e. is random-access).

When your Pepper program starts, the String variable will refer to something that implements this interface, and probably some other interfaces too. Most Pepper programs will use a String that is implemented as an array of bytes representing a string in UTF-8, but the programmer doesn’t need to be aware of that, and in a situation where something different is needed (e.g. where we know lots of non-Latin characters will be used and UTF-16 will be more efficient) String can be set to something different in the configuration settings used by the compiler.

When you want to do something with a string, there will be functions that only rely on the AnyString interface and deal with CodePoints internally, but there will be other overloads that are potentially more efficient, for example there are two versions of the standard print function:

def void print( implements(AnyString) string )
def void print( NativeUtf8String string )

The NativeUtf8String class is implemented as a std::string in the C++ code emitted by the Pepper compiler, and the most efficient way to represent an array of bytes when compiling onto other platforms, so the version of print that uses it can be quite efficient.

Because all these types are known at compile time, the C++ code generated by the Pepper compiler can use the native types directly (and be efficient), even though the programmer is writing code using just the AnyString and String types, meaning their code can be adapted to other platforms by using a different configuration.

The Pepper environment exposes standard-out and standard-in as UTF-8 streams, and takes care of converting to the platform encoding for you (at runtime).