Announcing I-DUNNO 1.0 and web-i-dunno

It’s hard to believe it’s already a year since the release of RFC 8771 (The Internationalized Deliberately Unreadable Network NOtation), which for me at least made me think about IP addresses in a whole new way.

So, it seems fitting for the anniversary to be able to release proper support for this standard in the Rust universe, with Rust I-DUNNO version 1.0 released. You can find it on Rust’s crates.io at crates.io/crates/i-dunno and the API documentation is at docs.rs/i-dunno.

Also, because for a standard like this to receive the wide adoption it deserves, it’s important that young people have a chance to interact with it, playing with encodings to get a real feel for what it’s like to use in practice, I’m proud to announce the I-DUNNO Creator. On that page you can enter an IP address (IPv4 or IPv6) and see it transformed immediately into a candidate I-DUNNO, with live information about the Confusion Level of the I-DUNNO, as specified in the standard. You can find the source code for the I-DUNNO Creator in the web-i-dunno repo.

The I-DUNNO Creator is built on the Rust package, making use of Rust’s highly-developed WASM support to compile the code into a form that works naturally in a web browser.

I hope that by offering both systems programmers and the young people of today and their new-fangled web sites the opportunity to create I-DUNNOs, I can contribute a little to spreading the word about deliberately unreadable notations to new audiences.

Note: the current implementation is limited to generate only I-DUNNOs with no padding bits. As specified in the standard, I-DUNNOs may end with arbitrary padding, and adding this functionality to rust-i-dunno is left as an exercise for the reader: merge requests welcome!

Automatically filling in the UK COVID test results page with Selenium IDE

Lots of people are filling in the extremely detailed UK government COVID test result page twice every week.

It asks you to fill in a very large list of details, most of which are the same every time, but it doesn’t remember what you typed last time.

I didn’t want to write a Python script or similar to enter my results, because I wanted to check I’d done it right, and because there is a captcha at the end that is clearly intended to prevent automation like that.

However, with a Selenium IDE script, I can drive my browser, watching what it does and checking the input, and manually filling in the captcha and final double-check page.

In case it’s helpful, here is the script I created: report-covid-test.side.

You can create one for each child if you have several, filling in school name, NHS number, names, date of birth etc. in the script and re-using it, modifying it each time to enter the bar code number for the test itself.

To use it you’ll need the Selenium IDE plugin for firefox, or Selenium IDE plugin for another browser.

I’d recommend loading this script into the Selenium IDE plugin in Firefox, looking through it and editing the values that say “ENTER…HERE”, then clicking Run Script and watching it fill in values.

It doesn’t actually submit the result, so you can always check and modify it manually if something doesn’t work out, before clicking the last couple of buttons to submit.

Questions about RFC 8771

During my work on RFC 8771 The Internationalized Deliberately Unreadable Network NOtation (I-DUNNO) I have come across a number of questions. I am documenting them here so I can send them to the authors and try to improve my understanding of the intention.

This is an excellent RFC, and I thank the authors for their efforts in creating it.

1. Non-printable characters

In 4.2. Satisfactory Confusion Level, the RFC states that encodings may be deemed Satisfactory if they contain ‘At least one non-printable character’ (as well as one other condition in this section).

Both of the existing implementations that I know of ( and ) interpret “printable” to mean the same as Python’s isprintable() function: that is Nonprintable characters are those characters defined in the Unicode character database as “Other” or “Separator”, excepting the ASCII space (0x20) which is considered printable.

However, the definition of this function may be rather Python-specific, since its intention appears to be related to language internals like the repr function.

It would be helpful to find out exactly what is meant by “non-printable character” in the RFC.

2. Are Modifier Symbols, Symbols?

Also in section 4.2, the RFC mentions ‘”A character from the “Symbol” category’.

The Python implementation excludes Modifier Symbols from its definition. I believe this is incorrect, and have logged an issue on the topic: Some Symbol characters not recognised as such.

It would be helpful to have clarification on this point.

3. What does “different directionalities” mean?

Unicode classifies characters into several Bidi_Classes (for example, U+CED6E is Left_To_Right). In section 4.3. Delightful Confusion Level, the RFC refers to ‘Characters from scripts with different directionalities’.

As far as I can see there are two possible interpretations of this phrase:

  1. The encoding should contain characters from at least two different Bidi_Classes, or
  2. The encoding should contain characters that are both left-to-right and right-to-left in direction, either weakly or strongly.

Both current implementations interpret this statement like number 1, but I suspect the intention may actually be something more like number 2.

If number 2 was meant, I think that would mean ignoring characters with Neutral directionality, and treating weakly directional characters as the same directionality as strongly directional ones.

4. What is a Confusable character?

Section 4.3 mentions ‘Character classified as “Confusables”‘. Both implementations interpret this quite loosely, as meaning something like ‘the encoding contains any character or substring which might be confused with any other character or substring’.

This means a lot of “normal” characters are included: all of the ASCII digits, and many of the Latin letters.

Was this is intention?

That’s all my questions. It has been great fun working on this RFC.

Announcing Rust I-DUNNO

At the ACCU Conference last week I learned about RFC 8771 The Internationalized Deliberately Unreadable Network NOtation (I-DUNNO) from Jim Hague, and thought it would be fun to knock up a Rust implementation.

The project is here: gitlab.com/andybalaam/rust-i-dunno and the docs are published at https://docs.rs/i-dunno.

It’s not done yet, but encoding an IP address as I-DUNNO appears to be working:

$ i-dunno 216.58.205.46
lYԮ

$ i-dunno 216.58.205.46 | hexdump -C
00000000  db 81 6b 1a 2e 0a                                 |..k...|

Decoding is still to be done.

The implementation is seriously slow at the moment, so I am looking forward to improving it.

I am hoping it is reasonably correct – I based it on the existing Python I-DUNNO implementation and in the process found several potential bugs in that, and created some merge requests to fix bugs and help with testability.

Speaking of testability, I am building up a collection of test cases that could be a potential resource for other implementors, and would welcome suggestions of how this could be shared between projects. The examples so far were generated using the Python implementation, and then manually corrected where I found bugs in that, so I do not have 100% confidence that they are correct.

Anyway, have a play, and send patches and feedback!

Limiting the number of open sockets in a tokio-based TCP listener

I learned quite a bit today about how to think about concurrency in Rust. I was trying to use a Semaphore to limit how many open sockets my TCP listener allowed, and I had real trouble making it work. It either didn’t actually work, allowing any number of clients to connect, or the compiler told me I couldn’t do what I wanted to do, because the lifetime of my Semaphore was not 'static. Here’s the journey I took towards working code that I think is correct (feedback welcome).

Motivation

In the tokio tutorial there is a short section entitled “Backpressure and bounded channels” (partway down the Channels page). It contains this statement:

…take care to ensure total amount of concurrency is bounded. For example, when writing a TCP accept loop, ensure that the total number of open sockets is bounded.

Obviously, when I started work on a TCP accept loop, I wanted to follow this advice.

Like many things in my journey with Rust, it was harder than I expected, and eventually enlightening.

The code

Here is a short Rust program that listens on a TCP port and accepts incoming connections.

Cargo.toml:

[package]
name = "tcp-listener-example"
version = "1.0.0"
edition = "2018"
include = ["src/"]

[dependencies]
tokio = { version = ">=1.0.1", features = ["full"] }

src/main.rs:

use tokio::io::AsyncReadExt;
use tokio::net::TcpListener;

#[tokio::main]
async fn main() {
    let listener = TcpListener::bind("0.0.0.0:8080").await.unwrap();

    loop {
        let (mut tcp_stream, _) = listener.accept().await.unwrap();
        tokio::spawn(async move {
            let mut buf: [u8; 1024] = [0; 1024];
            loop {
                let n = tcp_stream.read(&mut buf).await.unwrap();
                if n == 0 {
                    return;
                }
                print!("{}", String::from_utf8_lossy(&buf[0..n]));
            }
        });
    }
}

This program listens on port 8080, and every time a client connects, it spawns an asynchronous task to deal with it.

If I run it with:

cargo run

It starts, and I can connect to it from multiple other processes like this:

telnet 0.0.0.0 8080

Anything I type into the telnet terminal window gets printed out in the terminal where I ran cargo run. The program works: it listens on TCP port 8080 and prints out all the messages it receives.

So what’s the problem?

The problem is that this program can be overwhelmed: if lots of processes connect to it, it will accept all the connections, and eventually run out of sockets. This might prevent other things working right on the computer, or it might crash our program, or something else. We need some kind of sensible limit, as the tokio tutorial mentions.

So how do we limit the number of people allowed to connect at the same time?

Just use a semaphore, dummy

A semaphore does exactly what we need here – it keeps a count of how many people are doing something, and prevents that number getting too big. So all we need to do is restrict the number of clients that we allow to connect using a semaphore.

Here was my first attempt:

use tokio::io::AsyncReadExt;
use tokio::net::TcpListener;
use tokio::sync::Semaphore;

#[tokio::main]
async fn main() {
    let listener = TcpListener::bind("0.0.0.0:8080").await.unwrap();
    let sem = Semaphore::new(2);

    loop {
        let (mut tcp_stream, _) = listener.accept().await.unwrap();
        // Don't copy this code: it doesn't work
        let aq = sem.try_acquire();
        if let Ok(_guard) = aq {
            tokio::spawn(async move {
                let mut buf: [u8; 1024] = [0; 1024];
                loop {
                    let n = tcp_stream.read(&mut buf).await.unwrap();
                    if n == 0 {
                        return;
                    }
                    print!("{}", String::from_utf8_lossy(&buf[0..n]));
                }
            });
        } else {
            println!("Rejecting client: too many open sockets");
        }
    }
}

This compiles fine, but it doesn’t do anything! Even though we called Semaphore::new with an argument of 2, intending to allow only 2 clients to connect, in fact I can still connect more times than that. It looks like our code changes had no effect at all.

What we were hoping to happen was that every time a client connected, we created _guard, which is a SemaphoreGuard, that occupies one of the slots in the semaphore. We were expecting that guard to live until the client disconnects, at which point the slot will be released.

Why doesn’t it work? It’s easy to understand when you think about what tokio::spawn does. It creates a task and asks for it to be executed in the future, but it doesn’t actually run it. So tokio::spawn returns immediately, and _guard is dropped, before the code that handles the request is executed. So, obviously, our change doesn’t actually restrict how many requests are being handled because the semaphore slot is freed up before the request is processed.

Just hold the guard for longer, dummy

So, let’s hold on to the SemaphoreGuard for longer:

use tokio::io::AsyncReadExt;
use tokio::net::TcpListener;
use tokio::sync::Semaphore;

#[tokio::main]
async fn main() {
    let listener = TcpListener::bind("0.0.0.0:8080").await.unwrap();
    let sem = Semaphore::new(2);

    loop {
        let (mut tcp_stream, _) = listener.accept().await.unwrap();
        let aq = sem.try_acquire();
        if let Ok(guard) = aq {
            tokio::spawn(async move {
                let mut buf: [u8; 1024] = [0; 1024];
                loop {
                    let n = tcp_stream.read(&mut buf).await.unwrap();
                    if n == 0 {
                        drop(guard);
                        return;
                    }
                    print!("{}", String::from_utf8_lossy(&buf[0..n]));
                }
            });
        } else {
            println!("Rejecting client: too many open sockets");
        }
    }
}

The idea is to pass the SemaphoreGuard object into the code that actually deals with the client request. The way I’ve attempted that is by referring to guard somewhere within the async move closure. What I’ve actually done is tell it to drop guard when we are finished with the request, but actually any mention of that variable within the closure would have been enough to tell the compiler we want to move it in, and only drop it when we are done.

It all sounds reasonable, but actually this code doesn’t compile. Here’s the error I get:

error[E0597]: `sem` does not live long enough
  --> src/main.rs:12:18
   |
12 |         let aq = sem.try_acquire();
   |                  ^^^--------------
   |                  |
   |                  borrowed value does not live long enough
   |                  argument requires that `sem` is borrowed for `'static`
...
29 | }
   | - `sem` dropped here while still borrowed

What the compiler is saying is that our SemaphoreGuard is referring to sem (the Semaphore object), but that the guard might live longer than the semaphore.

Why? Surely sem is held within a scope that includes the whole of the client-handling code, so it should live long enough?

No. Actually, the async move closure that we are passing to tokio::spawn is being added to a list of tasks to run in the future, so it could live much longer. The fact that we are inside an infinite loop confused me further here, but the principle still remains: whenever we make a closure like this and pass something into it, the closure must own it, or if we are borrowing it, it must live forever (which is what a 'static lifetime means).

The code above passes ownership of guard to the closure, but guard itself is referring to (borrowing) sem. This is why the compiler says that “sem is borrowed for 'static“.

Wrong things I tried

Because I didn’t understand what I was doing, I tried various other things like making sem an Arc, making guard an Arc, creating guard inside the closure, and even trying to make sem actually have 'static storage by making it a constant. (That last one didn’t work because only very simple types like numbers and strings can be constants.)

Solution: Share the Semaphore in an Arc

After what felt like too much thrashing around, I found what I think is the right answer:

use std::sync::Arc;
use tokio::io::AsyncReadExt;
use tokio::net::TcpListener;
use tokio::sync::Semaphore;

#[tokio::main]
async fn main() {
    let listener = TcpListener::bind("0.0.0.0:8080").await.unwrap();
    let sem = Arc::new(Semaphore::new(2));

    loop {
        let (mut tcp_stream, _) = listener.accept().await.unwrap();
        let sem_clone = Arc::clone(&sem);
        tokio::spawn(async move {
            let aq = sem_clone.try_acquire();
            if let Ok(_guard) = aq {
                let mut buf: [u8; 1024] = [0; 1024];
                loop {
                    let n = tcp_stream.read(&mut buf).await.unwrap();
                    if n == 0 {
                        return;
                    }
                    print!("{}", String::from_utf8_lossy(&buf[0..n]));
                }
            } else {
                println!("Rejecting client: too many open sockets");
            }
        });
    }
}

This code:

  • Creates a Semaphore and stores it inside an Arc, which is a reference-counting pointer that can be shared between tasks. This means it will live as long as someone holds a reference to it.
  • Clones the Arc so we have a copy that can be safely moved into the async move closure. We can’t move sem in to the closure because it’s going to get used again the next time around the loop. We can move sem_clone in to the closure because it’s not used anywhere else. sem and sem_clone both refer to the same Semaphore object, so they agree on the count of clients that are connected, but they are different Arc instances, so one can be moved into the closure.
  • Only aquires the SemaphoreGuard once we’re inside the closure. This way we’re not doing something difficult like borrowing a reference to something that lives outside the closure. Instead, we’re borrowing a reference via sem_clone, which is owned by the closure which we are inside, so we know it will live long enough.

It actually works! After two clients are connected, listener.accept actually opens a socket to any new client, but because we return almost immediately from the closure, we only hold it open very briefly before dropping it. This seemed preferable to refusing to open it at all, which I thought would probably leave clients hanging, waiting for a connection that might never come.

Lifetimes are cool, and tricky

Once again, I have learned a lot about what my code is really doing from the Rust compiler. I find this stuff really confusing, but hopefully by writing down my understanding in this post I have helped my current and future selves, and maybe even you, be clearer about how to share a semaphore between multiple asynchronous tasks.

It’s really fun and empowering to write code that I am reasonably confident is correct, and also works. The sense that “the compiler has my back” is strong, and I like it.