Setting the text selection in a browser: just use setBaseAndExtent

The Selection API is confusing and weird.

But, here’s what I’ve discovered: just use setBaseAndExtend, and when (rarely) needed, extend.

Summary

Every selection in a browser consists of:

  • an “anchor” – the beginning, where you started dragging, and
  • a “focus” – the end, where your stopped dragging, and where your cursor is

To select some text in a browser, find the from and to DOM nodes you want, and how far through them you want to be, and set the selection like this:

const sel = document.getSelection();
sel.setBaseAndExtent(from_node, from_offset, to_node, to_offset);

If you want just a cursor with nothing selected, use the same node and offset for both from and to.

The Internet has a lot of recipes that involve creating Ranges and adding them to the Selection, but there is no need to do that.

Finding the from and to nodes

When the locations are inside a text node in the DOM, it’s easy to find the from and to nodes – they are just the text nodes, and the offsets are how many UTF-16 code units through the text you want to be. [Not sure what a code unit is? See my video Interesting Characters].

“But isn’t your cursor always in a text node?”

No – for example, when I have two <br>s next to each other, and I want the cursor to be in between them (so it’s on the blank line). In this case there is no text node, and browsers handle it weirdly.

In Firefox, when my cursor is between two br nodes, the node is set to the tag (e.g. a div) that contains the brs, and the offset is the index of the second br node in the list.

Other browsers may well do it differently.

So do expect that the nodes may not be text nodes, and where that is the case, the offset will be the index of one of their children.

To select the blank line (between two brs) you can actually specify the br node directly, and give an index number of 0, and it works too. But, it does not match exactly what the browser sets when you click on the blank line, so I can’t guarantee it works exactly the same.

Backwards selections

The fly in the ointment is backwards selections: when you click and drag the mouse from right to left, the selection that is created in the browser is backwards: the anchor is on the right and the focus is on the left.

However, if you call setBaseAndExtent attempting to set up a selection like that, the focus and anchor will be swapped so that the anchor is on the left, and the focus is on the right.

But never fear! We can use extend to force the selection to be what we wanted:

const sel = document.getSelection();
sel.setBaseAndExtent(from_node, from_offset, from_node, from_offset);
sel.extend(to_node, to_offset);

Job done.

Go deeper

For more info, check out my demo page: Browser Selections Inventory.

Overview of broad US data on IT job hiring/firing and quitting

Software developers are employed by organizations and people change jobs, either voluntarily or not; every year a new batch of people join the workforce, e.g., new graduates. Governments track employment activities for a variety of reasons, e.g., tax collection, and monitoring labour supply and demand (for the purposes of planning).

The US Bureau of Labor Statistics’ publishes a monthly summary of their Job Openings and Labor Turnover Survey. What can be learned about software development employment from this data (description)?

The data starts in December 2000, with each row contains a monthly count of Job Openings, Hires, Quits, Layoffs and Discharges, and totals, along with one of 21 major non-farm industry codes or one of the 5 government codes (the counts are broken out by State). I’m guessing that software developers are assigned the Information code (i.e., 510000), but who is to say that some have not been classified with the code for, say, Construction or Education and health services. The Information code will cover a lot more than just software developers; I’m trading off broad IT coverage for monthly details on employment turnover (software developer specific information is available, but it comes without the turnover information). The Bureau of Labor Statistics make available a huge quantity of information, and understanding how it all fits together would probably require me to spend several months learning my way around (I have already spent a week or two over the years), so I’m sticking with a prebuilt dataset.

The plot below shows the aggregated monthly counts (i.e., all states) of Job Openings, Hires, Quits, Layoffs and Discharges for the Information industry code (code+data):

Aggregated monthly counts of Job Openings, Hires, Quits, Layoffs and Discharges for the Information industry code, from 2000 to 2022.

The general trend follows the ups and downs of the economy, there is a huge spike in layoffs in early 2020 (the start of COVID), and Job Openings often exceeding Hires (which I did not expect).

These counts have the form of a time-series, which leads to the questions about repeating patterns in the sequence of values? The plot below shows the autocorrelation of the four employment counts (code+data):

Autocorrelation of Job Openings, Hires, Quits, Layoffs time series for the Information code.

The spike in Hires at 12-months is too large to be just be new graduates entering the workforce; perhaps large IT employers have annual reviews for all employees at the same time every year, causing some people to quit and obtain new jobs (Quits has a slightly larger spike at 12-months). Why is there a regular 3-month cycle for Job Openings? The negative correlation in Layoffs at one & two months is explained by companies laying off a batch of workers one month, followed by layoffs in the following two months being lower than usual.

I don’t know much about employment practices, so I won’t speculate any more. Comments welcome.

Are there any interest cross-correlations between the pairs of time-series?

The plot below shows four pairs of cross correlations (code+data):

Cross correlation between the pairs of time series Hires/Layoffs, Quits/Layoffs, Job Openings/Hires, and Hires/Quits.

Hires & Layoffs shows a scattered pattern of Hires preceding Layoffs (to be expected), and the bottom left shows there is a pattern of Quits preceding Layoffs (are people searching for steadier employment when layoffs loom?). Top right shows a pattern of Job Openings following Hires (I’m clutching at straws for this; is Hires a proxy for Quits, the cross correlation of Job Openings & Quits does have Job Openings leading), the bottom right shows the pattern of Hires leading Quits.

Nothing in this analysis surprised me, but then it is rather basic and broad brush. These results are the start of an analysis of the IT employment ecosystem; one that probably won’t progress far because of a lack of data and interest on my part.

ARD & Winterfyleth at the Bread Shed

ARD

Following a nightmare which is parking in Manchester and getting a meal on a Saturday night without booking, we walked into the Bread Shed just as ARD were getting going, minus my vinyl and CDs for signing!. The masterpiece which is Take Up My Bones was instantly recognisable, as were composer and multi instrumentalist Mark Deeks and fellow Winterfyleth band mate Chris Naughton, both on guitar. The latter was centre stage, where surely Deeks should have been?

From the off the band, who were put together to perform an album which was never intended to be performed live, were a little loose with the drums too prominent and the guitars not clear enough. There appeared to be a lot retuning necessary, especially from Chris and the lead guitarist who appeared hidden away a lot of the time. This didn’t really detract from enjoyment of the incredible compositions from the album.  By the time the final 10 minutes, consisting of Only Three Shall Know, came along something had changed, the band was as tight as anything and I wished they could have started again from the beginning. 45 minutes had flown by and I’ll certainly go and see them again.

Winterfyleth


I think I’ve seen Winterfyleth four times now, including the set which became their live album recorded at Bloodstock and earlier this year supporting Emperor at Incineration Fest. They never disappoint.

Winterfyleth are one of those bands that are so consistent with their music, without being boring or repetitive, that it doesn’t matter what they play or how familiar I am with the songs, it’s just incredible to listen to. Having said that, disappointingly, they didn’t play A Valley Thick With Oaks, which is my favourite. Who can resist singing along “In the heart of every Englishman…”? However, I did come away with a new favourite in Green Cathedral!

We only got an hour, but at least they didn’t bugger about going off and coming back for an encore. There were old songs, new songs and never before played live songs. Loved every second of it and, for the first time for me, the final song wasn’t preceded with “Sadly time is short and our songs are long, so this is our last one.” Until next time!

 

Optimal sizing of a product backlog

Developers working on the implementation of a software system will have a list of work that needs to be done, a to-do list, known as the product backlog in Agile.

The Agile development process differs from the Waterfall process in that the list of work items is intentionally incomplete when coding starts (discovery of new work items is an integral part of the Agile process). In a Waterfall process, it is intended that all work items are known before coding starts (as work progresses, new items are invariably discovered).

Complaints are sometimes expressed about the size of a team’s backlog, measured in number of items waiting to be implemented. Are these complaints just grumblings about the amount of work outstanding, or is there an economic cost that increases with the size of the backlog?

If the number of items in the backlog is too low, developers may be left twiddling their expensive thumbs because they have run out of work items to implement.

A parallel is sometimes drawn between items waiting to be implemented in a product backlog and hardware items in a manufacturer’s store waiting to be checked-out for the production line. Hardware occupies space on a shelf, a cost in that the manufacturer has to pay for the building to hold it; another cost is the interest on the money spent to purchase the items sitting in the store.

For over 100 years, people have been analyzing the problem of the optimum number of stock items to order, and at what stock level to place an order. The economic order quantity gives the optimum number of items to reorder, Q (the derivation assumes that the average quantity in stock is Q/2), it is given by:

Q=sqrt{{2DK}/h}, where D is the quantity consumed per year, K is the fixed cost per order (e.g., cost of ordering, shipping and handling; not the actual cost of the goods), h is the annual holding cost per item.

What is the likely range of these values for software?

  • D is around 1,000 per year for a team of ten’ish people working on multiple (related) projects; based on one dataset,
  • K is the cost associated with the time taken to gather the requirements, i.e., the items to add to the backlog. If we assume that the time taken to gather an item is less than the time taken to implement it (the estimated time taken to implement varies from hours to days), then the average should be less than an hour or two,
  • h: While the cost of a post-it note on a board, or an entry in an online issue tracking system, is effectively zero, there is the time cost of deciding which backlog items should be implemented next, or added to the next Sprint.

    If the backlog starts with n items, and it takes t seconds to decide whether a given item should be implemented next, and f is the fraction of items scanned before one is selected: the average decision time per item is: avDecideTime={f*n*(f*n+1)/2}*t seconds. For example, if n=50, pulling some numbers out of the air, f=0.5, and t=10, then avDecideTime=325, or 5.4 minutes.

    The Scrum approach of selecting a subset of backlog items to completely implement in a Sprint has a much lower overhead than the one-at-a-time approach.

If we assume that K/h==1, then Q=sqrt{2*1000}=44.7.

An ‘order’ for 45 work items might make sense when dealing with clients who have formal processes in place and are not able to be as proactive as an Agile developer might like, e.g., meetings have to be scheduled in advance, with minutes circulated for agreement.

In a more informal environment, with close client contacts, work items are more likely to trickle in or appear in small batches. The SiP dataset came from such an environment. The plot below shows the number of tasks in the backlog of the SiP dataset, for each day (blue/green) and seven-day rolling average (red) (code+data):

Tasks waiting to be implemented, per day, over duration of SiP projects.

Visual Lint 8.0.10.355 has been released

Visual Lint 8.0.10.355 has now been released.

This is a maintenance update for Visual Lint 8.0, and includes the following changes:

  • The VisualLintConsole command line parser now accepts spaces between the name and value of a command line switch. As a result, switches of the form. /switchname = <value> are now accepted, whereas previously only /switchname=<value> would work.

  • Entries in the file list within manually created custom project (.vlproj) files can now use wildcards. Note however that if the project file is subsequently edited by VisualLintGui, the file list will be expanded to explicitly reflect the files actually found. As such, this feature is intended for use with hand-created .vlproj files (which are really just .ini files, and as such lend themselves nicely to hand-editing).

  • Fixed a bug opening analysis tool manual PDFs located on UNC drives.

  • Fixed a bug in VisualLintGui which could cause some file save operations to fail.

  • Updated the PC-lint Plus compiler indirect file co-rb-vs2022.lnt to support Visual Studio 2022 v17.2.6.

  • Updated the PC-lint Plus compiler indirect fileco-rb-vs2019.lnt to support Visual Studio 2019 v16.11.17.

  • Updated the PC-lint Plus Boost library indirect file lib-rb-boost.lnt.

  • Removed nonfunctional "Print" menu commands from VisualLintGui.

  • Updates to various help topics.

Download Visual Lint 8.0.10.355

Tips for contenteditables

I’ve been working a bit with contenteditable tags in my HTML, and learnt a couple of things, so here they are.

Why can’t I see the cursor inside an empty contenteditable?

If you make an editable div like this:

<div contenteditable="true">
</div>

and then try to focus it, then sometimes, in some browsers, you won’t see a cursor.

You can fix it by adding a <br /> tag:

<div contenteditable="true">
<br />
</div>

Now you should get a cursor and be able to edit text inside.

Programmatically selecting text inside a contenteditable

It’s quite tricky to get the browser to select anything. Here’s a quick recipe for that:

<div id="ce" contenteditable="true">
Some text here
</div>
<script>
const ce = document.getElementById("ce");
const range = document.createRange();
range.setStart(ce.firstChild, 6);
range.setEnd(ce.lastChild, 10);
const sel = document.getSelection();
sel.removeAllRanges();
sel.addRange(range);
</script>

This selects characters 6 to before-10, i.e. the word “text”. To select more complicated stuff inside tags etc. you need to find the actual DOM nodes to pass in to setStart and setEnd, which is quite tricky.

Whenever you setHTML on a contenteditable, add a BR tag

If you use setHTML on a contenteditable you should always append a <br /> on the end. It doesn’t appear in any way, and it prevents weird problems.

Most notably, if you want to have an empty line at the end of your text, you need two <br /> tags, like this:

<div id="ce" contenteditable="true">
Some text here
</div>
<script>
const ce = document.getElementById("ce");
ce.innerHTML = "a<br /><br />"
</script>

If you only include one br tag, there will be no empty line at the end.

Selecting the end of a contenteditable

It’s surprisingly tricky to put the cursor at the end of a contenteditable div but here is a recipe that works:

const range = document.createRange();
range.selectNodeContents(ce);
range.collapse();
const sel = document.getSelection();
sel.removeAllRanges();
sel.addRange(range);

(Where ce is our contenteditable div.)

More tips?

Any more tips? Drop them in the comments and I’ll include them.

Career progression: an invisible issue in software development

Career progression is an important issue in the development of some software systems, but its impact is rarely discussed, let along researched. A common consequence of career progression is that a project looses a member of staff, e.g., they move to work on a different project, or leave the company. Hiring staff and promoting staff are related neglected research areas.

Understanding the initial and ongoing development of non-trivial software systems requires an understanding of the career progression, and expectations of progression, of the people working on the system.

Effectively working on a software system requires some amount of knowledge of how it operates, or is intended to operate. The loss of a person with working knowledge of a system reduces the rate at which a project can be further developed. It takes time to find a suitable replacement, and for that person’s knowledge of the behavior of the existing system to reach a workable level.

We know that most software is short-lived, but know almost nothing about the involvement-lifetime of those who work on software systems.

There has been some research studying the durations over which people have been involved with individual Open source projects. However, I don’t believe the findings from this research, because I think that non-paid involvement on an Open source project has very different duration and motivation characteristics than a paying job (there are also data cleaning issues around the same person using multiple email addresses, and people working in small groups with one person submitting code).

Detailed employment data, in bulk, has commercial value, and so is rarely freely available. It is possible to scrape data from the adverts of job websites, but this only provides information about the kinds of jobs available, not the people employed.

LinkedIn contains lots of detailed employment history, and the US courts have ruled that it is not illegal to scrape this data. It’s on my list of things to do, and I keep an eye out for others making such data available.

The National Longitudinal Survey of Youth has followed the lives of 10k+ people since 1979 (people were asked to detail their lives in periodic surveys). Using this data, Joseph, Boh, Ang, and Slaughter investigated the job categories within the career paths of 500 people who had worked in a technical IT role. The plot below shows the career paths of people who had spent at least five years working in an IT role (code+data):

The job categories contained within the seven career paths in which people spent at least five years working in a technical IT role.

Employment history provides an upper bound for the time that a person is likely to have worked on a project (being employed to work on an Open source project while, over time, working at multiple companies is an edge case).

A company may have employees simultaneously working on multiple projects, spending a percentage of their time on each. How big a resource impact is the loss of such a person? Were they simply the same kind of cog in multiple projects, or did they play an important synchronization role across projects? Details on all the projects a person worked on would help answer some questions.

Building a software system involves a lot more than writing the code. Technical managers working on high level, broad brush, issues. The project knowledge that technical managers have contributes to ongoing work, and the impact of loosing a technical manager is probably more of a longer term issue than loosing a coding-developer.

There are systems that are developed and maintained by essentially one person over many years. These get written about and celebrated, but are comparatively rare.

One of the more reliable ways of estimating developer productivity is to measure the impact of them leaving a project.

Glory! Hammer!

Glory! Hammer! Were fantastic!

They played well and were lots of fun as you’d expect.  I mean who doesn’t like a gig to start with a cardboard cutout of Tom Jones and Delilah playing on the PA. The band were all dressed up - it must have been very hot - and playing their parts.

I did find some of the gaps between songs and the interplay with the audience felt a little too Steel Panther. It was too frequent, superfluous and added time to a set which could have been shorter.

Sozos Michael is a phenomenal singer and makes it seem effortless and perfect. I’m a big fan of widdly guitar and it doesn’t stand out as much on record as it did live which was a really nice surprise.

It’ll be great to see them again when the promised new album is out and they tour again.




Programming Languages: History and Fundamentals

Programming Languages: History and Fundamentals by Jean E. Sammet is often cited in discussions of language history, but very rarely read (I appreciate that many oft cited books have not been read by those citing them, but age further reduces the likelihood that anybody has read this book; it was published in 1969). I read this book as an undergraduate, but did not think much of it. For around five years it has been on my list of books to buy, should a second-hand copy become available below £10 (I buy anything vaguely interesting below this price, with most ending up left on trains or the book table of coffee shops).

Thanks to Adam Gashlin the Internet Archive now contains a downloadable copy.

The list of 120 languages covered contains a handful of the 28 languages covered in an article from 1957. Sammet says that of the 120, 20 are already dead or on obsolete computers (i.e., it is unlikely that another compiler will be written), and that about 15 are widely used/implemented).

Today, the book is no longer a discussion of the recent past, but a window in to the Cambrian explosion of programming languages that happened in the 1960s (almost everything since then has been a variation on a theme); languages from the 1950s are also included.

How does the material appear to me from a 2022 vantage-point?

The organization of the book reminded me that programming languages were once categorized by application domain, i.e., scientific/engineering users, business users, and string & list processing (i.e., academic users). This division reflected the market segmentation for computer hardware (back then, personal computers were still in the realm of science fiction). Modern programming language books (e.g., Scott’s “Programming Language Pragmatics”) often organize material based on implementation details, e.g., lexical analysis, and scoping rules.

The overview of programming languages given in the first three chapters covers nearly all the basic issues that beginners are taught today, but the emphasis is different (plus typographical differences, such as keyword spelt ‘key word’).

Two major language constructs are missing: Dynamic storage allocation is not discussed: Wirth’s book Algorithms + Data Structures = Programs is seven years in the future, and Kernighan and Ritchie’s The C Programming Language nine years; Simula gets a paragraph, but no mention of the object-oriented concepts it introduced.

What is a programming language, and what are the distinguishing features that make some of them high-level programming languages?

These questions may sound pointless or obvious today, but people used to spend lots of time arguing over what was, or was not, a high-level language.

Sammet says: “… the first characteristic of a programming language is that the user can write a program without knowing much—if anything—about the physical characteristics of the machine on which the program is to be run.”, and goes on to infer: “… a major characteristic of a programming language is that there must be a reasonable potential of having a source program written in that language run on two computers with different machine codes without rewriting the source program. … In most programming languages, some—but often very little—rewriting of the source program is necessary.”

The reason that some rewriting of the source was likely to be needed is that there were often a lot of small variations between compilers for the same language. Compilers tended to be bespoke, i.e., the Fortran compiler for the X cpu running OS Y was written specifically for that combination. Retargetting an existing compiler to a new cpu or OS was much talked about, but it was more fun to write a new compiler (and anyway, support for new features was needed, and it was simpler to start from scratch; page 149 lists differences in Fortran compilers across IBM machines). It didn’t help that there was also a lot of variation in fundamental quantities such as word length, e.g., 16, 18, 20, 24, 32, 36, 40, 48, 60 bit words; see page 18 of Dictionary of Computer Languages.

Sammet makes the distinction: “One of the prime differences between assembly and higher level languages is that to date the latter do not have the capability of modifying themselves at execution time.”

Sammet then goes on to list the advantages and disadvantages of what she calls higher level languages. Most of the claimed advantages will be familiar to readers: “Ease of Learning”, “Ease of Coding and Understanding”, “Ease of Debugging”, and “Ease of Maintaining and Documenting”. The disadvantages included: “Time Required for Compiling” (the issue here is converting assembler source to object code is much faster than compiling a high-level language), “Inefficient Object Code” (the translation process was often a one-to-one mapping of what was written, e.g., little reuse of register contents), “Difficulties in Debugging Without Learning Machine Language” (symbolic debuggers are still in the future).

Sammet’s observation: “In spite of the fact that higher level languages have been with us for over 10 years, there has been relatively little quantitative or qualitative analysis of their advantages and disadvantages.” is still true 50 years later.

If you enjoy learning about lots of different languages, you will like this book. The discussion of specific languages contains copious examples, which for me brought things to life.

Sites such as the Internet Archive and Bitsavers make the book’s references accessible (there are a few I had not seen before), and offer readers a path to pre-Cambrian times.

Saul Rosen’s 1967 book “Programming Systems and Languages” is sometimes cited in discussions of programming language history. This book is a collection of papers that discuss a variety of languages and the operating systems that support them. Fewer languages are covered, but in more depth, along with lots of implementation details. Again, lots of interesting references.

A review of React Cookbook: Recipes for Mastering the React Framework

React Cookbook: Recipes for Mastering the React Framework

by David Griffiths and Dawn Griffiths
ISBN: 978-1492085843

This is a book of about 100 recipes across 11 sections. The sections range from the basics, such as creating React apps, routing and managing state to the more involved topics such as security, accessibility and performance.

I was especially pleased to see that the section on creating apps looked at create-react-app, nextjs and a number of other getting started tools and libraries, rather than just sticking with create-react-app.

I instantly liked the way each recipe laid out the problem it was solving, the solution and then had a discussion on different aspects of the solution. It immediately felt a bit like a patterns book. For example, after describing how to use create-react-app, the discussion section explains in more depth what it really is, how it works, how to use it to maintain your app and how to get rid of it.

As with a lot of React developers, the vast majority of the work I do is maintaining existing applications, rather than creating new ones from scratch. I frequently forget about how to setup things like routing scratch and would usually reach for Google. However, with a book like this I can see myself reaching for the easy to find recipes again and again.

Task backlog waiting times are power laws

Once it has been agreed to implement new functionality, how long do the associated tasks have to wait in the to-do queue?

An analysis of the SiP task data finds that waiting time has a power law distribution, i.e., numTasks approx waitingTime^{-1}, where numTasks is the number of tasks waiting a given amount of time; the LSST:DM Sprint/Story-point/Story has the same distribution. Is this a coincidence, or does task waiting time always have this form?

Queueing theory analyses the properties of systems involving the arrival of tasks, one or more queues, and limited implementation resources.

A basic result of queueing theory is that task waiting time has an exponential distribution, i.e., not a power law. What software task implementation behavior is sufficiently different from basic queueing theory to cause its waiting time to have a power law?

As always, my first line of attack was to find data from other domains, hopefully with an accompanying analysis modelling the behavior. It’s possible that my two samples are just way outside the norm.

Eventually I found an analysis of the letter writing response time of Darwin, Einstein and Freud (my email asking for the data has not yet received a reply). Somebody writes to a famous scientist (the scientist has to be famous enough for people to want to create a collection of their papers and letters), the scientist decides to add this letter to the pile (i.e., queue) of letters to reply to, eventually a reply is written. What is the distribution of waiting times for replies? Yes, it’s a power law, but with an exponent of -1.5, rather than -1.

The change made to the basic queueing model is to assign priorities to tasks, and then choose the task with the highest priority (rather than a random task, or the one that has been waiting the longest). Provided the queue never becomes empty (i.e., there are always waiting tasks), the waiting time is a power law with exponent -1.5; this behavior is independent of queue length and distribution of priorities (simulations confirm this behavior).

However, the exponent for my software data, and other data, is not -1.5, it is -1. A 2008 paper by Albert-László Barabási ( detailed analysis)showed how a modification to the task selection process produces the desired exponent of -1. Each of the tasks currently in the queue is assigned a probability of selection, this probability is proportional to the priority of the corresponding task (i.e., the sum of the priorities/probabilities of all the tasks in the queue is assumed to be constant); task selection is weighted by this probability.

So we have a queueing model whose task waiting time is a power law with an exponent of -1. How well does this model map to software task selection behavior?

One apparent difference between the queueing model and waiting software tasks is that software tasks are assigned to a small number of priorities (e.g., Critical, Major, Minor), while each task in the model queue has a unique priority (otherwise a tie-break rule would have to be specified). In practice, I think that the developers involved do assign unique priorities to tasks.

Why wouldn’t a developer simply select what they consider to be the highest priority task to work on next?

Perhaps each developer does select what they consider to be the highest priority task, but different developers have different opinions about which task has the highest priority. The priority assigned to a task by different developers will have some probability distribution. If task priority assignment by developers is correlated, then the behavior is effectively the same as the queueing model, i.e., the probability component is supplied by different developers having different opinions and the correlation provides a clustering of priorities assigned to each task (i.e., not a uniform distribution).

If this mapping is correct, the task waiting time for a system implemented by one developer should have a power law exponent of -1.5, just like letter writing data.

The number of sprints that a story is assigned to, before being completely implemented, is a power law whose exponent varies around -3. An explanation of this behavior based on priority queues looks possible; we shall see…

The queueing models discussed above are a subset of the field known as bursty dynamics; see the review paper Bursty Human Dynamics for human behavior related aspects.

2-day More Concurrent Thinking class at CppCon 2022

I am excited to be going to CppCon again this year, where I will be running a 2-day class: More Concurrent Thinking in C++: Beyond the Basics.

The class is onsite at the conference venue in Aurora, Colorado, USA, on Saturday 10th September 2022 and Sunday 11th September 2022, immediately before the main conference.

I'll be teaching you to think about the synchronization properties of the constructs you use, and how to work with the low level facilities provided by the C++20 standard to build high level abstractions. We'll look at actors, thread pools and executors, and how to design code that works with those abstractions.

We'll look at how to use low-level atomic operations to build lock-free data structures, and the issues that can arise when doing so. We'll also look at how scalability concerns can impact your design choices.

Finally, we'll look at the important issue of testing multithreaded code, and the tools we can use to help us find the causes of problems.

If this sounds interesting, sign up for the class when you register for the main conference.

I'll also be doing a presentation on Tuesday: An Introduction to Multithreading in C++20. This is a whistle-stop tour of the multithreading features of C++20, briefly covering each feature and what it is used for, and which you should choose by default.

Whether you come to my class and presentation or not, I hope to see you there!

Posted by Anthony Williams
[/ news /] permanent link
Tags: , , ,
Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

Follow me on Twitter

Outreachy August 2022 update

I had the pleasure of being a mentor this summer for an Outreachy internship for the Matrix organisation. Outreachy provides internships to people subject to systemic bias and impacted by underrepresentation in the technical industry where they are living.

Outreachy is a fantastic organisation doing a brilliant job to try and make our sometimes terrible industry a little bit better.

Outreachy logo

Mentoring was great fun, mainly because it was such a pleasure working with my awesome intern Usman. There is lots of support available for interns and mentors through Outreachy’s Zulip chat (when will we persuade them to use Matrix? ;-) so you always have somewhere to turn if you have questions.

If you want to read more about the internship from Usman’s point of view, check out his blog posts:

  • Outreachy Blog – Introducing Myself
  • Wrap-up: Summary of my journey to being an outreachy intern at Element

    We talked every day on video calls, and really enjoyed working together. Some days we would just chat, sometimes I would give pointers for things to try in the code, or people to talk to. Some days we worked through some code together, and that was the most fun. Usman is incredibly enthusiastic and bright, so it was very satisfying making suggestions and seeing him put them into practice.

    Success!

    The work went very well, and Usman succeeded in creating a prototype that will help us design the Favourite Messages feature:

    Note: the feature isn’t ready to be fully release because it needs to be implemented on mobile platforms as well as changing where it stores its information: currently we use the browser’s local storage, but we plan to store things in Matrix, meaning it is automatically synchronised between devices.

    Things that went well

    • Meeting every day: we talked on a short video call every morning. This meant if we misunderstood each other it was quickly resolved, without lots of time being wasted.
    • Having a clear list of tasks: we kept a tracking issue on Github. This meant were clear what Usman was supposed to be doing now, and what was coming next.
    • Being flexible: we worked together to change the list of tasks every week or so. This meant we were being realistic about what could be achieved, and able to change in response to things we found out, or feedback from others.
    • Getting design input: we talked to Element’s designers several times during the project, showing them prototypes and early implementations. This meant we didn’t waste time implementing things that would need redesign later.
    • Support for me: I was able to work with Thib, who is our Outreachy Matrix community organiser, especially during the selection process. This meant I was not making decisions in isolation, and had support if anything tricky came up.
    • The Element Web community: Usman got loads of support from our community. Special thanks to Šimon, Olivier, Shay and t3chguy for your help!
    • Element the company: Element paid for this internship, and gave great support to Usman, integrating him into all our systems, inviting him to introductory meetings etc. He had every opportunity to see what working at Element is like, and to make an impression on everyone here. Element did a great job here.

    Things I would do differently

    • Managing the contribution period: before the project began, applicants are invited to contribute to the projects, allowing us to choose an intern based on those contributions. I felt slightly disorganised at this stage, and there was a lot of activity in issues and pull requests in the project from applicants. I think I should have warned our community and explained what was going to happen up-front, and maybe enlisted help from people willing to triage the contributions a little. Contributions varied in quality and understanding level, so having some volunteers who were primed to spend a little more time explaining and helping contributors get started would have prevented this impinging on the time of the team as a whole. Nevertheless, our team responded really well, and we got some useful contributions, and I hope the contributors had a good experience too.
    • Integrating Usman into the team: we chose a project that was independent from what other team members were doing, meaning he mostly interacted with others when he needed help. While it is sensible to make sure interns are decoupled from the main work (because it’s hard to predict how much progress they will make, and they are going stop after their internship), I do also wish we could have found a project that gave more opportunity to work with other people, not just “stealing” their time to help out, but actually working together on shared pieces of work. This is a tricky one to figure out, but food for thought.

    Conclusions

    The experience of being a mentor was really fun, and I would recommend it to anyone working on an open source project.

    I would emphasise, though, that you need to put aside enough time: the internship will not be successful if you don’t make time to work with your intern, get to know them, and introduce them to your community. Since interns may be new to the world of work, or shy about taking your time, as a mentor, you need to take responsibility for giving them enough support.

    Final note: as a mentor, you are NOT responsible for the work going well! Your responsibility is to help and support your intern, and give them everything they need to be successful (including feedback about things that are not working well), but it is up to the the intern themself to do the work, and how much work gets done is going to be the combination of a number of factors, including the intern’s experience and abilities. Don’t worry if you don’t get as far as you expected – after all, that happens in nearly all software projects…

Patterns in the LSST:DM Sprint/Story-point/Story ‘done’ issues

Projects that use Scrum as their project management framework estimate tasks (known as a user story, or just story) in units of Story-points. A collection of User stories are grouped together to be implemented during a Sprint (a time-boxed interval, often lasting 2-weeks).

What are Story-points, and how do they map to time (in hours and minutes)? For this post, let’s ignore these questions, simply assuming that the people who assign a story-point value to a story have some mapping in their head.

What is the average number of story-points in a story, and how does this average vary across teams? What is the distribution of number of stories estimated per sprint, how many are actually implemented, and how does this vary across teams?

The data required to answer these questions has not been publicly available, or rather public data is not known to me. Until this week, I had only known of a few public Jira repos where story-points were given for at most a few hundred stories.

The LSST Corporation, a not-for-profit involved in astronomy and physics research, has a Data Management (DM) project. The Jira repo for this project contains 26,671 ‘Done’ issues (as of Aug 2022), of which 11,082 (41.5%) have assigned story-points; there have been 469 sprints, which involved 33% of the issues. The start/end implementation date/time for stories is mostly rather granular, and not fine enough to be used to attempt to correlate individual stories with hours. I found this repo, and a couple of others, via the paper Story points changes in agile iterative development, and downloaded all available issues.

What patterns are present in the story-point and sprint data?

Story points are commonly thought of as being integer valued, but 28% of the values are non-integer. If any developers are using the Fibonacci scale, there are not enough to have a noticeable impact. The plot below shows the number of stories estimated to involve a given number of story-points (black pluses are non-integer values, which have been rounded to fit the regression model). The green curved line is a fitted biexponential (sum of two exponentials), with the two straight lines being the two component exponentials (code+data):

Number of stories estimated to involve a given number of story-points.

One exponential is dominant for stories assigned up to 10 story-points, and the second exponential for higher story-point values.

The development team decides to implement a story and allocates it to a sprint. A story may be reallocated to another sprint before the start of the original sprint, or after the sprint is finished when its implementation is incomplete or not yet started (the data does not allow for these cases to be distinguished). How many sprints is a story allocated to, before the story implementation is complete?

The plot below shows the number of stories allocated to a given number of sprints, with a fitted regression line of the form Stories approx Sprints^{-2.8} (code+data):

Number of stories assigned to a given number of distinct sprints.

So around 14% of stories are allocated to two sprints, 5% to three and 2% to four.

How many stories are assigned to a sprint? The plot below shows the number of sprints having a given number of stories assigned to them, and the number of sprints implementing a given number of stories; lines are fitted loess models (code+data):

Number of sprints assigned a given number of stories, and implementing a given number of stories.

Are the Story/Story-point/Sprint patterns found in the DM project likely to occur in other projects using Scrum?

I don’t know, but I hope so. Developing theories of software development processes requires that there be consistent patterns of behavior.

Not knowing what stories were assigned to a sprint at the start of the sprint, rather assigned earlier and then moved to another sprint, potentially undermines the sprint patterns. We will have to wait and see.

If anybody knows of any public Jira repos where a high percentage (say 40%) of the issues have been assigned story-points, please let me know (all the ones I know of on the Atlassian site contain a tiny percentage of story-points).

Impact of number of files on number of review comments

Code review is often discussed from the perspective of changes to a single file. In practice, code review often involves multiple files (or at least pull-based reviews do), which begs the question: Do people invest less effort reviewing files appearing later?

TLDR: The number of review comments decreases for successive files in the pull request; by around 16% per file.

The paper First Come First Served: The Impact of File Position on Code Review extracted and analysed 219,476 pull requests from 138 Java projects on Github. They also ran an experiment which asked subjects to review two files, each containing a seeded coding mistake. The paper is relatively short and omits a lot of details; I’m guessing this is due to the page limit of a conference paper.

The plot below shows the number of pull requests containing a given number of files. The colored lines indicate the total number of code review comments associated with a given pull request, with the red dots showing the 69% of pull requests that did not receive any review comments (code+data):

Number of pull requests containing a given number of files, for all pull requests, and those receiving at least 1, 2, 5, and 10 comments.

Many factors could influence the number of comments associated with a pull request; for instance, the number of people commenting, the amount of changed code, whether the code is a test case, and the number of files already reviewed (all items which happen to be present in the available data).

One factor for which information is not present in the data is social loafing, where people exert less effort when they are part of a larger group; or at least I did not find a way of easily estimating this factor.

The best model I could fit to all pull requests containing less than 10 files, and having a total of at least one comment, explained 36% of the variance present, which is not great, but something to talk about. There was a 16% decline in comments for successive files reviewed, test cases had 50% fewer comments, and there was some percentage increase with lines added; number of comments increased by a factor of 2.4 per additional commenter (is this due to importance of the file being reviewed, with importance being a metric not present in the data).

The model does not include information available in the data, such as file contents (e.g., Java, C++, configuration file, etc), and there may be correlated effects I have not taken into account. Consequently, I view the model as a rough guide.

Is the impact of file order on number of comments a side effect of some unrelated process? One way of showing a causal connection is to run an experiment.

The experiment run by the authors involved two files, each containing one seeded coding mistake. The 102 subjects were asked to review the two files, with file order randomly selected. The experiment looks well-structured and thought through (many are not), but the analysis of the results is confused.

The good news is that the seeded coding mistake in the first file was much more likely to be detected than the mistake in the second file, and years of Java programming experience also had an impact (appearing first had the same impact as three years of Java experience). The bad news is that the model (a random effect model using a logistic equation) explains almost none of the variance in the data, i.e., these effects are tiny compared to whatever other factors are involved; see code+data.

What other factors might be involved?

Most experiments show a learning effect, in that subject performance improves as they perform more tasks. Having subjects review many pairs of files would enable this effect to be taken into account. Also, reviewing multiple pairs would reduce the impact of random goings-on during the review process.

The identity of the seeded mistake did not have a significant impact on the model.

Review comments are an important issue which is amenable to practical experimental investigation. I hope that the researchers run more experiments on this issue.

Analysis of a subset of the Linux Counter data

The Linux Counter project was started in 1993, with the aim of tracking the growth of Linux users (the kernel was first released two years earlier). Anybody could register any of their machines running Linux; a user ran a script that gathered basic information about a machine, and the output was emailed to the project. Once registered, users received an annual reminder to update information in their entry (despite using Linux since before the 1.0 release, user #46406 didn’t register until 2001).

When it closed (reopened/closed/coming back) it had 120K+ registered users. That’s a lot of information about computers, which unfortunately is not publicly available. I have not had any replies to my emails to those involved, asking for a copy that could be released in anonymized form.

This week I found 15,906 rows of what looks like a subset of the Linux counter data, most entries are post-2005. What did I learn from this data?

An obvious use is the pattern to check is changes over time. While the data does not include any explicit date, it does include the Kernel version, from which the earliest date can be inferred.

An earlier post used SPEC data to estimate the growth in installed memory over time; it has been doubling every 840 days, give or take. That data contains one data point per distinct vendor computer; the Linux counter data contains one entry per computer in use. There is around thirty pairs of entries for updated systems, i.e., a user updated the entry for an existing system.

The plot below shows memory installed in each registered computer, over time, for servers, laptops and workstations, with fitted regression lines. The memory size doubling times are: servers 4,000 days, laptops 2,000 days, and workstations 1,300 days (code+data):

Memory reported for system owned by Linux counter users, from 1995 to 2015.

A regression model using dates is a good fit in the statistical sense, but explain very little of the variance in the data. The actual date on which the memory size was selected may have been earlier (because the kernel has been updated to a later release), or later (because memory was added, but the kernel was not updated).

Why is the memory doubling time so long?

Has memory size now reached the big-enough boundary, do Linux counter users keep the same system for many years without upgrading, are Linux counter systems retired Windows boxes that have been repurposed (data on installed memory Windows boxes would answer this point)?

Suggestions welcome.

When memory capacity is limited, it may be useful to swap least recently used memory contents to disc; Linux setup includes the specification of a swap partition. What is the optimal size of the size partition? A common recommendation is: if memory is less than 2G swap size is twice memory; if between 2-8G swap size is the same as memory, and for greater than 8G, half of memory size. The table below shows the percentage of particular system classes having a given swap/memory ratio (rounding the list of ratios to contain one decimal digit produces a list of over 100 ratio values).

swap/memory   Server  Workstation   Laptop 
   1.0         15.2       19.9       25.9
   2.0         10.3        9.6        8.6
   0.5          9.5        7.7        8.4

The plot below shows memory against swap partition size, for the system classes laptop, server and workstation, with fitted regression line (code+data):

Memory and swap size reported for system owned by Linux counter users.

The available disk space also has a (small) impact on swap partition size; the following model explains 46% of the variance in the data: swapSize approx memory^{0.65}diskSpace^{0.08}.

I was hoping to confirm the rate of installed memory growth suggested by the SPEC data, with installed systems lagging a few years behind the latest releases. This Linux counter data tells a very different growth story. Perhaps pre-2005 data will tell another story (I just need to find it).

I’m not sure if the swap/memory ratio analysis is of any use to systems people. It was something of a fishing expedition on my part.

Other counting projects have included the Ubuntu counter project, and Hardware for Linux which is still active and goes back to August 2014.

I’m interested in hearing about the availability of any other Linux counter data, or data from other computer counting projects.

Transcoding video files for playback in a browser

ffmpeg -i original.mkv -c:v libx264 -c:a aac -ac 2 -ab 384000 -ar 48000 new.mp4

(Short answer: use the above ffmpeg command line. Read on for how I did this in Tdarr.)

I recently discovered Jellyfin, which gives me a Netflix-like UI for viewing my own videos, and seems great.

The only problem I had was that some videos were in formats that can’t be played natively in a web browser. Jellyfin heroically tries to transcode them on the fly, but my server is very lightweight, and there’s no way it can do that.

So, I needed to transcode those videos to a more suitable format.

Tdarr allows transcoding large numbers of files, and with a little head-scratching I worked out how to get it running, but I still needed the right ffmpeg options to make the videos work well in Firefox, without needing transcoding of video, audio or the container.

Here are the Tdarr settings that I found worked well:

Output file container: .mp4
Encoder: ffmpeg
CLI arguments: -c:v libx264 -c:a aac -ac 2 -ab 384000 -ar 48000
Only transcode videos in these codecs: hevc

Explanation:

  • Output file container: .mp4 – wraps the video up in an MP4 container – surprisingly, Firefox doesn’t seem to support MKV.
  • -c:v libx264 – re-encodes the video as H.264. Firefox can’t do H.265, and H.264 is reasonably compatible with lots of browsers. If you don’t care about Safari or various Microsoft browsers, you might want to think about VP9 as it’s natively supported on Firefox, so should work on weird architectures etc.
  • -c:a aac -ac 2 -ab 384000 -ar 48000 – re-encodes the audio as AAC with the right bitrates etc. Jellyfin was still transcoding the audio when I just specified -c:a aac, and it took me a while to work out that you need those other options too.
  • Only transcode videos in these codecs: hevchevc means H.265 encoding, and the videos I had problems with were in that encoding, but you might have different problems. If in doubt, you can choose “Don’t transcode videos in these codecs:” and uncheck all the encodings, meaning all your videos will be re-encoded.

If you are not using Tdarr, here is the plain command line to use with ffmpeg:

ffmpeg -i original.mkv -c:v libx264 -c:a aac -ac 2 -ab 384000 -ar 48000 new.mp4

Estimation accuracy in the (building|road) construction industry

Lots of people complain about software development taking longer than estimated. Are estimates in other industries more accurate, and do they contain patterns similar to those seen in software task estimates?

Readers will probably not be surprised to learn that obtaining estimate/actual data is as hard for other industries as it is for software.

Software engineering sometimes gets compared with building construction, in the sense that building construction is perceived as being straightforward and predictable. My tiny experience with building construction is that it is not as straightforward and predictable as outsiders think, a view echoed by the few people in the building industry I have spoken to.

I have found two building datasets, the supplementary material from: Forecasting the Project Duration Average and Standard Deviation from Deterministic Schedule Information (the 101 rows also include some service projects), and Ballesteros-Pérez kindly sent me the data for Duration and Cost Variability of Construction Activities: An Empirical Study which included 746 rows of road construction estimate/actual data from an unknown source. This data is for large projects, where those involved had to bid to get the work.

The following plot reminds us of how effort vs actual often looks like for short software tasks; it includes a fitted regression model and prediction intervals at one standard deviation (68.3%) and two standard deviations (95%); the faint grey line shows Estimate == Actual (post discussing the analysis and linking to code+data):

Fitted regression model and prediction intervals at 68.3% and 95%.

The data in the above plot is for small tasks, which did not involve bidding for the work.

The following plot shows estimated vs actual duration for 101 construction projects. The red line has the form: Actual=1.09*Estimate, i.e., average estimate is 9% lower than actual duration (blue line shows Actual=Estimate; code+data).

Planned and actual duration of 101 building construction projects, with fitted regression (red) and estimate==actual (blue).

The obvious differences are that the fitted line shows consistent underestimation (hardly surprising when bidding for work; 16% of estimates are greater than the actual), that the variance of project estimate/actual about the line is much smaller for building construction, and that the red/blue lines are essentially parallel (the exponent for software tasks is consistently around 0.85, rather than 1)

The following plot shows estimated vs actual for 746 road construction projects. The red line has the form: Actual=1.24*Estimate, i.e., average estimate is 24% lower than actual duration (blue line shows Actual=Estimate; code+data):

Planned and actual duration of 746 road construction projects, with fitted regression (red) and estimate==actual (blue).

Again there is a consistent average underestimate (project bidding was via an auction process), the red/blue lines are essentially parallel, and while the estimate/actual variance is larger than for building construction only 1.5% estimates are greater than the actual.

Consistent underestimating is not surprising for external projects awarded via a bidding process.

The unpredicted differences are the much smaller estimate/actual variance (compared to software), and the fitted line running parallel to Actual=Estimate.

Evolution of the DORA metrics

There is a growing buzz around the DORA metrics. Where did the DORA metrics come from, what are they, and are they useful?

The company DevOps Research and Assessment LLC (DORA) was founded by Nicole Forsgren, Jez Humble, and Gene Kim in 2016, and acquired by Google in 2018. DevOps is a role that combines software development (Dev) and IT operations (Ops).

The original ideas behind the DORA metrics are described in the 2015 paper DevOps: Profiles in ITSM Performance and Contributing Factors, by Forsgren and Humble. The more well known Accelerate book, published in 2018, is an evangelistic reworking of the material, plus some business platitudes extolling the benefits of using a lean process.

The 2015 paper approaches the metric selection process from the perspective of reducing business costs, and uses a data driven approach. This is how metric selection should be done, and for the first seven or eight pages I was cheering the authors on. The validity of a data driven approaches depends on the reliability of the data and its applicability to the questions being addressed. I don’t think that the reliability of the data used is sufficient to support the conclusions being drawn from it. The data used is the survey results behind the Puppet Labs 2015 State of DevOps Report; the 2018 book included data from the 2016 and 2017 State of DevOps reports.

Between 2015-2018, DORA is more a way of doing DevOps than a collection of metrics to calculate. The theory is based on ideas from the Economic Order Quantity model; this model is used in inventory management to calculate the number of items that should be held in stock, to meet production demand, such that stock holding costs plus item reordering costs are minimised (when the number of items in stock falls below some value, there is an optimum number of items to reorder to replenish stocks).

The DORA mapping of the Economic Order Quantity model to DevOps employs a rather liberal interpretation of the concepts involved. There are three fundamental variables:

  • Batch size: the quantity of additions, modifications and deletions of anything that could have an effect on IT services, e.g., changes to code or configuration files,
  • Holding cost: the lost opportunity cost of not deploying work that has been done, e.g., lost business because a feature is not available or waste because an efficiency improvement is not used. Cognitive capitalism also has the lost opportunity cost of not learning about the impact of an update on the ecosystem,
  • Transaction cost: the cost of building, testing and deploying to production a completed batch.

The aim is to minimise TotalCost=HoldingCost+TransationCost.

So far, so good and reasonable.

Now the details; how do we measure batch size, holding cost and transaction cost?

DORA does not measure these quantities (the paper points out that deployment frequency could be treated as a proxy for batch size, in that as deployment frequency goes to infinity batch size goes to zero). The terms holding cost and transaction cost do not appear in the 2018 book.

Having mapped Economic Order Quantity variables to software, the 2015 paper pivots and maps these variables to a Lean manufacturing process (the 2018 book focuses on Lean). Batch size is now deployment frequency, and higher is better.

Ok, let’s follow the pivoted analysis of Lean ideas applied to software. The 2015 paper uses cluster analysis to find patterns in the 2015 State of DevOps survey data. I have not seen any of the data, or even the questions asked; the description of the analysis is rather sketchy (I imagine it is similar to that used by Forsgren in her PhD thesis on a different dataset). The report published by Puppet Labs analyses the data using linear regression and partial least squares.

Three IT performance profiles are characterized (High, Medium and Low). Why three and not, say, four or five? The papers simply says that three ’emerged’.

The analysis of the Puppet Labs 2015 survey data (6k+ responses) essentially takes the form of listing differences in values of various characteristics between High/Medium/Low teams; responses came from “technical professionals of all specialities involved in DevOps”. The analysis in the 2018 book discussed some of the between year differences.

My experience of asking hundreds of people for data is that most don’t have any. I suspect this is true of those who answered the Puppet Labs surveys, and that answers are guestimates.

The fact that the accuracy of analysis of the survey data is poor does not really matter, because DORA pivots again.

This pivot switches to organizational metrics (from team metrics), becomes purely production focused (very appropriate for DevOps), introduces an Elite profile, and focuses on four key metrics; the following is adapted from Google:

  • Deployment Frequency: How often an organization successfully releases to production,
  • Lead Time for Changes: The amount of time it takes a commit to get into production,
  • Change Failure Rate: The percentage of deployments causing a failure in production,
  • Mean time to repair (MTTR): How long it takes an organization to recover from a failure in production.

Are these four metrics useful?

To somebody with zero DevOps experience (i.e., me) they look useful. The few DevOps people I have spoken to are talking about them but not using them (not least because they don’t have the data required).

The characteristics of the Elite/High/Medium/Low profiles reflects Google’s DevOps business interests. Companies offering an online service at a national scale want to quickly respond to customer demand, continuously deploy, and quickly recover from service outages.

There are companies where it makes business sense for DevOps deployments to occur much less frequently than at Google. I also know companies who would love to have deployment rates within an order of magnitude of Google’s, but cannot even get close without a significant restructuring of their build and deployment infrastructure.

Matrix is a Distributed Real-time Database Video

Curious to know a bit more about Matrix? This video goes into the details of what kinds of requests you need to send to write a Matrix client, and why it’s interesting to write a Matrix server.

Slides: Matrix is a Distributed Real-time Database Slides

Really excited that since I started my job working on Matrix, I have become more enthusiastic about it, rather than less.

Extracting numbers from a stacked density plot

A month or so ago, I found a graph showing a percentage of PCs having a given range of memory installed, between March 2000 and April 2020, on a TechTalk page of PC Matic; it had the form of a stacked density plot. This kind of installed memory data is rare, how could I get the underlying values (a previous post covers extracting data from a heatmap)?

The plot below is the image on PC Matic’s site:

Percentage of PC having a given amount of installed memory, from 2000 to 2020.

The change of colors creates a distinct boundary between different memory capacity ranges, and it ought to be possible to find the y-axis location of each color change, for a given x-axis location (with location measured in pixels).

The image was a png file, I loaded R’s png package, and a call to readPNG created the required 2-D array of pixel information.

library("png")
img=readPNG("../rc_mem_memrange_all.png")

Next, the horizontal and vertical pixel boundaries of the colored data needed to be found. The rectangle of data is surrounded by white pixels. The number of white pixels (actually all ones corresponding to the RGB values) along each horizontal and vertical line dramatically drops at the data image boundary. The following code counts the number of col points in each horizontal line (used to find the y-axis bounds):

horizontal_line=function(a_img, col)
{
lines_col=sapply(1:n_lines, function(X) sum((a_img[X, , 1]==col[1]) &
                                            (a_img[X, , 2]==col[2]) &
                                            (a_img[X, , 3]==col[3]))
                )

return(lines_col)
}

white=c(1, 1, 1)
n_cols=dim(img)[2]

# Find where fraction of white points on a line changes dramatically
white_horiz=horizontal_line(img, white)

# handle when upper boundary is missing
ylim=c(0, which(abs(diff(white_horiz/n_cols)) > 0.5))
ylim=ylim[2:3]

Next, for each vertical column of pixels, at each x-axis pixel location, the sought after y value occurs at the change of color boundary in the corresponding vertical column. This boundary includes a 1-pixel wide separation color, which creates a run of 2 or 3 consecutive pixel color changes.

The color change is easily found using the duplicated function.

# Return y position of vertical color changes at x_pos
y_col_change=function(x_pos)
{
# Good enough technique to generate a unique value per RGB color
col_change=which(!duplicated(img[y_range, x_pos, 1]+
                          10*img[y_range, x_pos, 2]+
                         100*img[y_range, x_pos, 3]))

# Handle a 1-pixel separation line between colors. 
# Diff is used to find these consecutive sequences.
y_change=c(1, col_change[which(diff(col_change) > 1)+1])

# Always return a vector containing max_vals elements.
return(c(y_change, rep(NA, max_vals-length(y_change))))
}

Next, we need to group together the sequence of points that delimit a particular boundary. The points along the same boundary are all associated with the same two colors, i.e., the ones below/above the boundary (plus a possible boundary color).

The plot below shows all the detected boundary points, in black, overwritten by colors denoting the points associated with the same below/above colors (code):

Colored points showing detected area colow boundaries.

The visible black pluses show that the algorithm is not perfect. The few points here and there can be ignored, but the two blocks at the top of the original image have thrown a spanner in the works for some range of points (this could be fixed manually, or perhaps it is possible to tweak the color extraction formula to work around them).

How well does this approach work with other stacked density plots? No idea, but I am on the lookout for other interesting examples.

Multi-state survival modeling of a Jira issues snapshot

Work items in a formal development process progress through a series of stages, e.g., starting at Open, perhaps moving to Withdrawn or Merged with another item, eventually reaching Development, and finishing at Done (with a few being Reopened, i.e., moving back to the start of the process).

This process can be modelled as a Markov chain, provided data on each stage of the process is available, for each work item; allowing values such as average time spent in each state and transition probabilities to be calculated.

The Jira issue/task/bug/etc tracking system has an option to generate a snapshot of the current status of work items in the system. The snapshot information on each item includes: start-date, current-state, time-in-state, date-of-snapshot.

If we assume that all work items pass through the same sequence of states, from Open to Done, then the snapshot contains enough information to build a multi-state survival model.

The key information is time-in-state, which can be used to calculate the date/time when an item transitioned from its previous state to its current state, providing a required link between all states.

How is a multi-state survival model better than creating a distinct survival model for each state?

The calculation of each state in a multi-state model takes into account information from the succeeding state, i.e., the time-in-state value in the succeeding state provides timing (from the Start state) on when a work item transitioned from its previous state. While this information could be added to each of the distinct models, it’s simpler to bundle everything together in one model.

A data analysis article by Robert Krasinski linked to the data used 🙂 The data does not include a description of the columns, but most of the names appear self-explanatory (I have no idea what key might be). Each of the 3,761 rows includes a story-point estimate, team-id, and a tag name for the work item.

Building a multi-state model provides a means for estimating the impact of team-id and story-points on time-in-state. I would expect items with higher story-point estimates to spend longer in Development, but I’m not sure how much difference there will be on other states.

I pruned the 22 states present in the data down to the following sequence of 13. Items might be Withdrawn or Merged with others items at any time, but I’m keeping things simple. These two states should also be absorbing in that there is no exit from them, I faked this by adding a transition to Done.

           Open
           Withdrawn
           Merged
           Backlog
           In Analysis
           In Refinement
           Ready for Development
           In Development
           Code Review
           Ready for Test
           In Testing
           Ready for Signoff
           Done

I’m familiar with building survival models, but have only ever built a couple of multi-state survival models. R supports several packages, which is the best one to use for this minimalist multi-state dataset?

The msm package is very much into state transition probabilities, or at least that is the impression I got from reading its manual. flexsurv and mstate are other packages I looked at. I decided to stay with the survival package, the default for simpler problems; the manuals contained lots of examples and some of them appeared similar to my problem.

Each row of work item information in the Jira snapshot looks something like the following:

 X daysInStatus      start         status    obsdate
 1         0.53 2020-05-12 In Development 2020-05-18

This work item transitioned from state Ready for Development at time obsdate-start-daysInStatus to state In Development at time obsdate-start-daysInStatus+10^{-3}, and was still in state In Development at time obsdate-start (when the snapshot was taken); the 10^{-3} is a small interval used to separate the states.

As is often the case with R packages, most of the work went into figuring out how to call the library functions with the data formatted just so, plus of course my own misunderstandings. Once the data was cleaned and process, the analysis was one line of code plus one to print the results; for instance, to estimate the mean time in each state by story-point value (code+data):

  sp_fit=survfit(Surv(tstop-tstart, state) ~ sp, data=merged_status)
  print(sp_fit)

Given the uncertainties in this model building process, I’m not going to discuss the results. This post is a proof of concept, which others can apply when the sequence of states is known with some degree of confidence, and good reasons for noise in the data are available.

Building cross-platform Rust for Web, Android and iOS – a minimal example

One of the advantages of writing code in Rust is that it can be re-used in other places. Both iOS and Android allow using native libraries within your apps, and Rust compiles to native. Web pages can now use WebAssembly (WASM), and Rust can compile to WASM.

So, it should be easy, right?

Well, in practice it seems a little tricky, so I created a small example project to explain it to myself, so maybe it’s helpful to you too.

The full code is at gitlab.com/andybalaam/example-rust-bindings, but here is the general idea:

crates/example-rust-bindings - the real Rust code
bindings/ffi - uniffi code to build shared objects for Android and iOS
bindings/wasm - wasm_bingen code to build WASM for Web
examples/example-android - an Android app that generates a Kotlin wrapper, and runs the code in the shared object
examples/example-web - a web page that imports the WASM and runs it

Steps for WASM

Proof that I did this on Web - Firefox showing "This string is from Rust!"

Variation: if you modify the build script in package.json to call wasm-pack with --target node instead of --target web you can generate code suitable for using from a NodeJS module.

Steps for Android

Proof that I did this on Android: Android emulator showing a label "This string is from Rust!"

Steps for iOS

I am working on this and will fill it in later.

Books to be written

I’m entering the final stretch of writing for my new book so it is time to think hard about the name.

“How I write books” was only ever a working title, now I think I’ve hit on the permanent name “Books to be written“. What do you think?

Let me know – you can check out the (nearly finished) book on LeanPub right now. Almost all the chapters are now drafted.

The post Books to be written appeared first on Allan Kelly.

Estimating quantities from several hundred to several thousand

How much influence do anchoring and financial incentives have on estimation accuracy?

Anchoring is a cognitive bias which occurs when a decision is influenced by irrelevant information. For instance, a study by John Horton asked 196 subjects to estimate the number of dots in a displayed image, but before providing their estimate subjects had to specify whether they thought the number of dots was higher/lower than a number also displayed on-screen (this was randomly generated for each subject).

How many dots do you estimate appear in the plot below?

Image containing 500 dots.

Estimates are often round numbers, and 46% of dot estimates had the form of a round number. The plot below shows the anchor value seen by each subject and their corresponding estimate of the number of dots (the image always contained five hundred dots, like the one above), with round number estimates in same color rows (e.g., 250, 300, 500, 600; code+data):

Anchor value seen by a subject and corresponding estimate of number of dots.

How much influence does the anchor value have on the estimated number of dots?

One way of measuring the anchor’s influence is to model the estimate based on the anchor value. The fitted regression equation Estimate=54*Anchor^{0.33} explains 11% of the variance in the data. If the higher/lower choice is included the model, 44% of the variance is explained; higher equation is: Estimate=169+1.1*Anchor and lower equation is: Estimate=169+0.36*Anchor (a multiplicative model has a similar goodness of fit), i.e., the anchor has three-times the impact when it is thought to be an underestimate.

How much would estimation accuracy improve if subjects’ were given the option of being rewarded for more accurate answers, and no anchor is present?

A second experiment offered subjects the choice of either an unconditional payment of $2.50 or a payment of $5.00 if their answer was in the top 50% of estimates made (labelled as the risk condition).

The 196 subjects saw up to seven images (65 only saw one), with the number of dots varying from 310 to 8,200. The plot below shows actual number of dots against estimated dots, for all subjects; blue/green line shows Estimate == Actual, and red line shows the fitted regression model Estimate approx Actual^{0.9} (code+data):

Actual and estimated number of dots in image seen by subjects.

The variance in the estimated number of dots is very high and increases with increasing actual dot count, however, this behavior is consistent with the increasing variance seen for images containing under 100 dots.

Estimates were not more accurate in those cases where subjects chose the risk payment option. This is not surprising, performance improvements require feedback, and subjects were not given any feedback on the accuracy of their estimates.

Of the 86 subjects estimating dots in three or more images, 44% always estimated low and 16% always high. Subjects always estimating low/high also occurs in software task estimates.

Estimation patterns previously discussed on this blog have involved estimated values below 100. This post has investigated patterns in estimates ranging from several hundred to several thousand. Patterns seen include extensive use of round numbers and increasing estimate variance with increasing actual value; all seen in previous posts.

Most percentages are more than half

Most developers think …

Most editors …

Most programs …

Linguistically most is a quantifier (it’s a proportional quantifier); a word-phrase used to convey information about the number of something, e.g., all, any, lots of, more than half, most, some.

Studies of most have often compared and contrasted it with the phrase more than half; findings include: most has an upper bound (i.e., not all), and more than half has a lower bound (but no upper bound).

A corpus analysis of most (432,830 occurrences) and more than half (4,857 occurrences) found noticeable usage differences. Perhaps the study’s most interesting finding, from a software engineering perspective, was that most tended to be applied to vague and uncountable domains (i.e., there was no expectation that the population of items could be counted), while uses of more than half almost always had a ‘survey results’ interpretation (e.g., supporting data cited as collaboration for 80% of occurrences; uses of most cited data for 19% of occurrences).

Readers will be familiar with software related claims containing the most qualifier, which are actually opinions that are not grounded in substantive numeric data.

When most is used in a numeric based context, what percentage (of a population) is considered to be most (of the population)?

When deciding how to describe a proportion, a writer has the choice of using more than half, most, or another qualifier. Corpus based studies find that the distribution of most has a higher average percentage value than more than half (both are left skewed, with most peaking around 80-85%).

When asked to decide whether a phrase using a qualifier is true/false, with respect to background information (e.g., Given that 55% of the birlers are enciad, is it true that: Most of the birlers are enciad?), do people treat most and more than half as being equivalent?

A study by Denić and Szymanik addressed this question. Subjects (200 took part, with results from 30 were excluded for various reasons) saw a statement involving a made-up object and verb, such as: “55% of the birlers are enciad.” They then saw a sentence containing either most or more than half, that was either upward-entailing (e.g., “More than half of the birlers are enciad.”), or downward-entailing (e.g., “It is not the case that more than half of the birlers are enciad.”); most/more than half and upward/downward entailing creates four possible kinds of sentence. Subjects were asked to respond true/false.

The percentage appearing in the first sentence of the two seen by subjects varied, e.g., “44% of the tiklets are hullaw.”, “12% of the puggles are entand.”, “68% of the plipers are sesare.” The percentage boundary where each subjects’ true/false answer switched was calculated (i.e., the mean of the percentages present in the questions’ each side of true/false boundary; often these values were 46% and 52%, whose average is 49; this is an artefact of the question wording).

The plot below shows the number of subjects whose true/false boundary occurred at a given percentage (code+data):

Number of subjects whose true/false boundary occurred at a given percentage.

When asked, the majority of subjects had a 50% boundary for most/more than half+upward/downward. A downward entailment causes some subjects to lower their 50% boundary.

So now we know (subject to replication). Most people are likely to agree that 50% is the boundary for most/more than half, but some people think that the boundary percentage is higher for most.

When asked to write a sentence, percentages above 50% attract more mosts than more than halfs.

Most is preferred when discussing vague and uncountable domains; more than half is used when data is involved.

Over/under estimation factor for ‘most estimates’

When asked to estimate the time taken to perform a software development related task, people regularly over or under estimate. What range of over/under estimation falls within the bounds of the term ‘most estimates’, i.e., the upper/lower bounds of the ratio Actual/Estimate (an overestimate occurs when Actual/Estimate < 1, an underestimate when 1 < Actual/Estimate)?

On Twitter, I have been citing a factor of two for over/under time estimates. This factor of two involves some assumptions and a personal interpretation.

The following analysis is based on the two major software task effort estimation datasets: SiP and CESAW. The tasks in both datasets are for internal projects (i.e., no tendering against competitors), and require at most a few hours work.

The following analysis is based on the SiP data.

Of the 8,252 unique tasks in the SiP data, 30% are underestimates, 37% exact, and 33% overestimates.

How do we go about calculating bounds for the over/under factor of most estimates (a previous post discussed calculating an accuracy metric over all estimates)?

A simplistic approach is to average over each of the overestimated and underestimated tasks. The plot below shows hours estimated against the ratio actual/estimated, for each task (code+data):

Actual/estimate ratio for SiP tasks having a given Estimate value.

Averaging the over/under estimated tasks separately (using the geometric mean) gives 0.47 and 1.9 respectively, i.e., tasks are over/under estimated by a factor of two.

This approach fails to take into account the number of estimates that are over/under/equal, i.e., it ignores likelihood information.

A regression model takes into account the distribution of values, and we could adopt the fitted model’s prediction interval as the over/under confidence intervals. The prediction interval is the interval within which other observations are expected to fall, with some probability (R’s predict function uses one standard deviation).

The plot below shows a fitted regression model and prediction intervals at one standard deviation (68.3%) and two standard deviations (95%); the faint grey line shows Estimate == Actual (code+data):

Fitted regression model and prediction intervals at 68.3% and 95%.

The fitted model tilts down from the upward direction of the Estimate == Actual line, consequently the over/under estimation factor depends on the size of the estimate. The table below lists the over/under estimation factor for low/high estimates at one & two standard deviations (68.3 and 95% probability).

People like simple answers (i.e., single values) and the mean value is a commonly used technique of summarising many values. The task estimate values are unevenly distributed and weighting the mean by the distribution of estimated values is more representative than, say, an evenly distributed set of estimates. The 5th and 6th columns in the table below list the weighted means at one/two standard deviations (the CESAW columns are the values for all projects in the CESAW data).

          1 sd           2 sd        Weighted mean       CESAW
       Low    High   Low    High     1 sd    2 sd     1 sd   2 sd
Over   0.56   0.24   0.27   0.11     0.46    0.25     0.29   0.1
Under  2.4    1.0    4.9    2.0      2.00    4.1      2.4    6.5

The weighted means for over/under estimates are close to a factor of two of the actual (divide/multiply) within one standard deviation (68.3%), and a factor of four within two standard deviations (95%).

Why choose to give the one standard deviation factor, rather than the two? People talk of “most estimates”, but what percentage range does ‘most’ map to? I have tended to think of ‘most’ as more than two-thirds, e.g., at least one standard deviation (a recent study suggests that ‘most’ usage peaks at 80-85%), and I think of two standard deviations as ‘nearly all’ (i.e., 95%; there are probably people who call this ‘most’).

Perhaps a between two and four is a more appropriate answer (particularly since the bounds are wider for the CESAW data). Suggestions welcome.

How fresh is your backlog?

Do you struggle with an overwhelming backlog?

Do you count the number of product backlog items in your backlog in tens? hundred? or thousands?

Does your backlog contain many stories which have been there for months, if not years, and yet never raise to the top of the backlog?

Is your success judged on your ability to do the backlog?

Backlogs were a good idea when they were introduced a bit over 20 years ago but today many teams slaves to the backlog – see my posts on the Tyranny of the backlog and Purpose over backlog. One of the benefits I’ve called out for OKRs is the ability to move away from backlog driven development (BLDD).

In Succeeding with OKRs in Agile I ask suggest you either need to prioritise your backlog over OKRs (in which case OKRs are derived from the backlog you intend to do) or OKRs over backlog (in which case OKRs are derived from strategy and the backlog plays a supporting role.) In my podcast with Jenny Herald earlier this year I even say “Let OKRs drive… nuke the backlog.”

Filipe Albero Pomar recently shared his backlog freshness blog which I think is great. Freshness is a great way to think about the state of the backlog that separates the size of the backlog from the relevance of the backlog.

Filipe’s idea is simply to talk about the backlog in terms of freshness – you have a fresh backlog if your backlog items are fresh: written recently, relate to current opportunities, problems and things people currently want.

And of course, the opposite of fresh: stale, stories that have been sitting in the backlog for months, even years, stories that relate to yesterday’s problems and project, stories which people wanted last year. The existence a big backlog of stale stories means the team is seen to be not delivering, the end-date is far off because people still expect all the work to be done.

Filipe suggests backlog freshness can be measured:

1. Set a cut-off date

2. Categorise stories as fresh or stale: fresh stories have been written since the cut-off date, those which are older are stale

3. Calculate freshness as a percentage of fresh from the total: if 25 stories out of 60 have been written in the last month then the backlog is 41% fresh, and 59% stale

Thats a useful metric, I think we can do better, look at the graphic above. I group backlog items into age groups and graphed them. For completeness I added a line to indicate average story age. Clearly this backlog is not fresh – nearly half the stories are over a year old.

I like the idea of graphing backlog freshness because it is easy to understand and makes an impact. In the graph above I’ve categorised backlog items into age groups and added an average line. Clearly this is not a fresh backlog. Whether this is the way to demonstrate backlog freshness I’m not sure – I’m playing with a histogram and quartile ranges.

With some clients I’ve thought of the backlog like a mortgage. There is the principle (the existing backlog), the interest rate (the growth rate of the backlog) and the monthly repayments (stories reaching done). Unfortunately when you do this you sometimes find the mortgage will not be paid for many years, and perhaps never. (Don’t worry about estimating the size of stories, for this sort of analysis the number of stories will get you started, and if your backlog is measured in hundreds of items the small will offset the large.)

I’d love to talk more about this and experiment with some ideas, I think it could be a very useful way of thinking about the backlog.


Subscribe to my blog newsletter and get Continuous Digital for free

The post How fresh is your backlog? appeared first on Allan Kelly.

A Wikihouse hackathon

I was at the Wikihouse hackathon on Wednesday. Wikihouse is an open-source project involving prefabricated house designs and building processes.

Why is a software guy attending what looks like a very non-software event? The event organizers listed software developer as one of the attendee skill sets. Also, I have been following the blog Construction Physics, where Brian Potter has been trying to work out why the efficiency of building construction has not significantly improved over many decades; the approach is wide-ranging, data driven and has parallels with my analysis of software engineering. I counted four software people at the event, out of 30’ish attendees; Sidd I knew from previous hacks.

Building construction shares some characteristics with software development. In particular, projects are bespoke, but constructed using subcomponents that are variations on those used in most other projects of the same kind of building, e.g., walls±window frames, floors/ceilings.

The Wikihouse design and build process is based around a collection of standardised, prefabricated subcomponents, called blocks; these are made from plywood, slotted together, and held in place using butterfly/bow-tie joints (wood has a negative carbon footprint). A library of blocks is available, with the page for each block including a DXF cutting file, assembly manual, 3-D model, and costing; there is a design kit for building a house, including a spreadsheet for costing, and a variety of How-Tos. All this is available under an open source license. The Open Systems Lab is implementing building design software and turning planning codes into code.

Not knowing anything about building construction, I have no way of judging the claims made during the hackathon introductory presentations, e.g., cost savings, speed of build, strength of building, expected lifetime, etc.

Constructing lots of buildings using Wikihouse blocks could produce an interesting dataset (provided those doing the construction take the time to record things). Questions such as: how does construction time vary by team composition (self-build is possible) and experience, and by number of rooms and their size spring to mind (no construction time/team data was recorded during the construction of the ‘beta test’ buildings).

The morning was taken up with what was essentially a product pitch, then we got shown around the ‘beta test’ buildings (they feel bigger on the inside), lunch and finally a few hours hacking. The help they wanted from software people was in connecting together some of the data/tools they had created, but with only a few hours available there was little that could be done (my input was some suggestions on construction learning curves and a few people/groups I knew doing construction data analysis)

Will an open source approach enable the Wikihouse project to succeed with its prefabricated approach to building construction, where closed source companies have failed when using this approach, e.g., Katerra?

Part of the reason that open source software succeeded was that it provided good enough functionality to startup projects/companies who could not afford to pay for software (in some cases the open source tools provided superior functionality). Some of these companies grew to be significant players, convincing others that open source was viable for production work. Source code availability allows developers to use it without needing to involve management, and plenty of managers have been surprised to find out how embedded open source software is within their group/organization.

Buildings are not like software, lots of people with some kind of power notice when a new building appears. Buildings need to be connected to services such as water, gas and electricity, and they have a rateable value which the local council is keen to collect. Land is needed to build on, and there are a whole host of permissions and certificates that need to be obtained before starting to build and eventually moving in. Doing it, and telling people later is not an option, at least in the UK.

Deporting desperate people from the UK

Letter to my MP on deporting refugees to Rwanda, 2022-06-06.

Dear Ben Spencer,

Please do what you can to reverse the policy of sending asylum seekers to Rwanda.

We are breaking our proud tradition of commitment to refugees.

This policy seems to have the intention of preventing people from drowning while attempting to enter the UK. Instead of ruling people’s claims “inadmissable” because they were desperate enough to enter by a dangerous route, we should provide safe routes for people escaping war and harm.

I am sure this policy was drafted with good intent, but immediately it started we have seen disproportionate numbers of Sudanese people being deported to Rwanda [1] verses other nationalities. Even within the parameters of its own flawed morality, this policy is unfair in practice. It should be stopped immediately.

[1] https://www.theguardian.com/uk-news/2022/jun/06/home-office-offers-asylum-seekers-choice-between-war-zones-they-fled-and-rwanda

Yours sincerely,

Andy Balaam

If you want to write a similar letter, feel free to use any of the above if it’s helpful. I used WriteToThem.com as a very easy way to find your MP and send a message.

Cognitive effort, whatever it might be

Software developers spend a lot of time acquiring knowledge and understanding of the software system they are working on. This mental activity fits within the field of Cognition, which covers all aspects of intellectual functions and processes. Human cognition as it related to software development is covered in chapter 2 of my book Evidence-based software engineering; a reading list.

Cognitive effort (e.g., thinking) is hard work, or at least mental effort feels like hard work. It has become fashionable for those extolling the virtues of some development technique/process to claim that one of its benefits is a reduction in cognitive effort; sometimes the term cognitive load is used, but I suspect this is not a reference to cognitive load theory (which is working memory based).

A study by Arai, with herself as the subject, measured the time taken to mentally multiply two four-digit values (e.g., 2,645 times 5,784). Over 2-weeks, Arai practiced on four days, on each day multiplying over 20 four-digit value pairs. A week later Arai multiplied 40 four-digit value pairs (starting at 1:45pm, finishing at 6:31pm), had dinner between 6:31-7:41 pm, and then, multiplied 20 four-digit value pairs (starting at 7:41, finishing at 10:07). The plot below shows the time taken for each mental multiplication sequence, with fitted regression lines (code+data):

Time taken for two sequences of mental multiplication, before/after dibber, with fitted regression lines.

Over the course of the first, 5-hour session, average time taken slowed from four to eight minutes. The slope of the regression fit for the second session is poor, although the fit for the start value (6 minutes) is good.

The average increase in time taken is assumed to be driven by a reduction in mental effort, caused by the mental fatigue experienced during an extended period of continuous mental work.

What do we know about cognitive effort?

TL;DR Many theories and little evidence.

Cognitive psychologists are still at the stage of figuring out what exactly cognitive effort is. For instance, what is going on when we try harder (or decide to give up), and what is being conserved when we conserve our mental resources? The major theories include:

  • Cognitive control: Mental processes form a continuum, from those that can be performed automatically with little or no effort, to those requiring concentrated conscious effort. Here, cognitive control is viewed as the force through which cognitive effort is exerted. The idea is that mental effort regulates the engagement of cognitive control in the same way as physical effort regulates the engagement of muscles.
  • Metabolic constraints: Mental processes consume energy (glucose is the brain’s primary energy source), and the feeling of mental effort is caused by reduced levels of glucose. The extent to which mental effort is constrained by glucose levels is an ongoing debate.
  • Capacity constraints: Working memory has a limited capacity (i.e., the oft quoted 7±2 limit), and tasks that fill this capacity do feel effortful. Cognitive load theory is based around this idea. A capacity limited working memory, as a basis of cognitive effort, suffers from the problem that people become mentally tired in the sense that later tasks feel like they require more effort. A capacity constrained model does not predict this behavior. Neither does a constraints model predict that increasing rewards can result in people exerting more cognitive effort.

How might cognitive effort be measured?

TL;DR It’s all relative or not at all.

To date, experiments have compared relative expenditure of effort between different tasks (some comparing cognitive with physical effort, other purely cognitive). For instance, showing that subjects are willing to perform a task requiring more cognitive effort when the expected reward is higher.

As always with human experiments, people can have very different behavioral characteristics. In particular, people differ in what is known as need for cognition, i.e., their willingness to invest cognitive effort.

While a lot of research has investigated the characteristics of working memory, the only real metric studied has been capacity, e.g., the longest sequence of digits that can be remembered/recalled, or span tasks involving having to remember words while performing simple arithmetic operations.

Experimental research on cognitive effort seems to be picking up, but don’t hold your breadth for reliable answers. Research of human characteristics can start out looking straight forward, but tends to quickly disappear down multiple, inconclusive rabbit holes.

Scrum or OKRs, which comes first?

“Hello Allan , I followed a few of your seminars on OKRs and I am reading your book.  I do have one question. I will start a transformation soon. The company has the ambition to move to Scrum and also OKRs; Do you have any advise on where to start? OKRs first or SCRUM first.  Thank you a lot !”

This question appeared in my mailbox – or rather my LinkedIn box – a few weeks ago and I thought other readers might be interested in my answer.

My answer is: do both together.

That will sound ambitious if you see Scrum and OKRs as distinct things but I don’t, to me they are synergistic and can be viewed as one change. In the process I see the opportunity to create a simpler form of agile, or simpler version of Scrum if you prefer. Let me describe.

The Scrum Master role remains pretty much the same as regular Scrum. The main change is that in addition to facilitating planning meetings and sprint retrospectives the Scrum Master will additionally facilitate OKR setting sessions and OKR retrospectives at the end of the cycle. They will want to ensure the OKRs are known by everyone and people reference to them during sprints.

Similarly the Product Owner role remains pretty much the same with the additional need for the PO to lead thinking on what OKRs to set. This means the Product Owner needs to do more horizon scanning, talk to more stakeholders and prepare to set objectives that will last several sprints rather than just one at a time.

Simplification comes in that there is no need to build up and administer a product backlog. Set the OKRs, then go straight into a planning meeting and ask “How do we make progress against this OKR?”. Write the stories you need to do that. Use the OKRs as a just-in-time story generator. If you generate stories you cannot do this sprint then put them in a product backlog for the next planning meeting.

In the next planning meeting review progress and ask yourself again: “what do we need to do to make progress on these OKRs?”. It might be that you have some stories from last time to work with, if not write some more. OKRs in effect become the Sprint Goal and you generate the stories from there. (If you are using a Product Goal as well then reference that when setting the OKRs at the start of the cycle.)

Estimation is reduced because you have fewer stories in play and in the backlog. If you want you can use story points and velocity to determine if the sprint is full or you can just ask the team “Is that enough work to keep us busy?” and “Do we have space for more?”

If you want a more sophisticated system then get the stories out on the table, put them in the order they are likely to be done, then go from top to bottom asking “On a scale of 1 to 10, What are the chances we will get this story done in this sprint? – where 1 is unlikely and 10 is almost certainly.” If the probability is below 8 you probably take action, break the story down, reorder the work order or just accept that you probably won’t do everything you would like to.

In the longer term you can count the stories done to build up a record of capacity.

I would aim to do end of sprint demonstrations and better still release to live. The OKR may not be complete but the stories in the sprint can start delivering value early. As usual keep quality high, aim for zero bugs and automate everything that needs testing.

If you are new to both Scrum and OKRs then I would probably run one week sprints and a six or eight-week OKR cycle the first time. Do a retrospective at the end of the OKR cycle, after that you might move to two-week sprints and 12 week OKR cycles but keep an open mind and decide what works for you.

All the way through keep delivering benefit to customers, hitting OKRs is icing on the cake, and there are no prizes for doing perfect Scrum.

I’d like to think I outlined this recipe in Succeeding with OKRs in Agile but I suspect I wasn’t as clear as I could have been. Increasingly I’m seeing OKRs as a means of stripping agile back and simplifying it.


Subscribe to my blog newsletter and download Continuous Digital for free

The post Scrum or OKRs, which comes first? appeared first on Allan Kelly.

Rounding and heaping in non-software estimates

Round numbers are often preferred in software task estimation times, e.g., 1, 5, 7 (hours in one working day), and 14. This human preference for round numbers is not specific to software, or to estimating. Round numbers can act as goals, as clustering points, may be used more often as uncertainty increases, or be the result of satisficing, etc.

Rounding can occur in response to any question involving a numeric value, e.g., a government census or survey asking citizens about their financial situation or health. Rounding introduces error in the analysis of data. The Whipple index, described in 1919, was the first attempt to quantify the amount of error; calculated as: “per cent which the number reported as multiples of 5 forms of one-fifth of the total number between ages 23 to 62 years inclusive.” for errors of reported age. Other metrics for this error have been proposed, and packages to calculate them are available.

At some point (the evidence suggests a 1940 paper) a published paper introduced the term heaping effect. These days, heaping is more often used to name the process, compared to rounding, e.g., heaping of values; ‘heaping’ papers do use the term rounding, but I have not seen ’rounding’ papers use heaping.

The choice of rounding values depends on the unit of measurement. For instance, reported travel arrival/departure times are rounded to intervals of 5, 14, 30 and 60 minutes; based on reported/actual travel times it is possible to estimate the probability that particular rounding intervals have been used.

The Whipple index fails when all the values are large (e.g., multiple thousands), or take a small range of values (e.g., between one and twenty).

One technique for handling rounding of large values is to define roundedness in terms of the fraction of value digits that are trailing zeroes. The plot below shows the number of households having a given estimated balance on their first mortgage in the 2013 Survey of Consumer Finances (in red), and the distribution of actual balances reported by the New York Federal Reserve (in blue/green; data extracted from plot in a paper and scaled to equalize total mortgage values; code+data):

Households estimated outstanding mortgage (red) and actual outstanding mortgages in New York (blue).

The relatively high number of distinct round numbers swamps any underlying distribution of actual values. While some values having some degree of roundness occur more often than non-round values, they still appear less often than expected by the known distribution. It is possible that homeowners have mortgages at round values because they of banking limits, or reasons other than rounding when answering the survey.

The plot below shows the number of people reporting having a given number of friends, plus number of cigarettes smoked per day, from the 2015 survey of Objective and Subjective Quality of Life in Poland (code+data):

Number of people reporting having a given number of friends, plus number of cigarettes smoked per day.

The narrow range of a person’s number of friends prevents the Whipple index from effectively detecting rounding/heaping.

The dominance of round numbers in the cigarettes smoked per day may be caused by the number of cigarettes contained in a packet, i.e., people may be accurately reporting that they smoke the contents of a packet, rather than estimating a rounded number.

Simple techniques are available for correcting the mean/variance when values are always rounded to specified boundaries. When the probability of rounding is not 100%, the calculation is more complicated.

Rounded/Heaped data contains multiple distributions, i.e., the non-rounded values and the rounded values; various mixture models have been proposed to fit such data. Alternatively, the data can be ‘deheaped’, and various deheaping techniques have been proposed.

Given the prevalence of significant amounts of rounding/heaping, it’s surprising how few people know about it.

Visual Lint 8.0.9.353 has been released

Visual Lint 8.0.9.353 has now been released.

This is a maintenance update for Visual Lint 8.0, and includes the following changes:

  • Added support for Codegear C++ Builder 11 Alexandria.

  • Fixed a crash in the Visual Studio plugin when a file was opened in Visual Studio 2022 v17.2.0 Preview. The crash was apparently caused by a change or regression in the VS2022 implementation of EnvDTE::DocumentEvents::DocumentOpening().

  • When analysing Visual Studio projects the appropriate -std and -m32 or -m64 switches are now automatically added to generated Clang-Tidy command lines.

  • Fixed a bug in the Clang-Tidy analysis results parser which caused issues containing multiple square brackets to be parsed incorrectly.

  • Updated the PC-lint Plus compiler indirect file co-rb-vs2022.lnt to support Visual Studio 2022 v17.1.6.

  • Updated the PC-lint Plus compiler indirect file co-rb-vs2019.lnt to support Visual Studio 2019 v16.11.13.

  • Updated the PC-lint Plus compiler indirect file co-rb-vs2017.lnt to support Visual Studio 2017 v15.9.47.

  • Updated the PC-lint Plus compiler indirect files co-rb-vs2017.lnt, co-rb-vs2019.lnt and co-rb-vs2022.lnt to limit the value of _MSVC_LANG to 201703L. This is necessary as the version of the Clang front-end used by PC-lint Plus 1.4.x and earlier does not include C++ 20 compiler intrinsics.

  • Updated the installer to terminate any instances of mspdbsrv.exe (the Visual Studio program database symbol server) before installing the Visual Studio plugin to VS2017/19/22.

  • Updated the text on the "Analysis Tools" page of the installer.

  • Updated the readme and help documentation to reflect the fact that Windows 11 is now supported.

Download Visual Lint 8.0.9.353

Complex software makes economic sense

Economic incentives motivate complexity as the common case for software systems.

When building or maintaining existing software, often the quickest/cheapest approach is to focus on the features/functionality being added, ignoring the existing code as much as possible. Yes, the new code may have some impact on the behavior of the existing code, and as new features/functionality are added it becomes harder and harder to predict the impact of the new code on the behavior of the existing code; in particular, is the existing behavior unchanged.

Software is said to have an attribute known as complexity; what is complexity? Many definitions have been proposed, and it’s not unusual for people to use multiple definitions in a discussion. The widely used measures of software complexity all involve counting various attributes of the source code contained within individual functions/methods (e.g., McCabe cyclomatic complexity, and Halstead); they are all highly correlated with lines of code. For the purpose of this post, the technical details of a definition are glossed over.

Complexity is often given as the reason that software is difficult to understand; difficult in the sense that lots of effort is required to figure out what is going on. Other causes of complexity, such as the domain problem being solved, or the design of the system, usually go unmentioned.

The fact that complexity, as a cause of requiring more effort to understand, has economic benefits is rarely mentioned, e.g., the effort needed to actively use a codebase is a barrier to entry which allows those already familiar with the code to charge higher prices or increases the demand for training courses.

One technique for reducing the complexity of a system is to redesign/rework its implementation, from a system/major component perspective; known as refactoring in the software world.

What benefit is expected to be obtained by investing in refactoring? The expected benefit of investing in redesign/rework is that a reduction in the complexity of a system will reduce the subsequent costs incurred, when adding new features/functionality.

What conditions need to be met to make it worthwhile making an investment, I, to reduce the complexity, C, of a software system?

Let’s assume that complexity increases the cost of adding a feature by some multiple (greater than one). The total cost of adding n features is:

K=C_1*F_1+C_2*F_2 ...+C_n*F_n

where: C_i is the system complexity when feature i is added, and F_i is the cost of adding this feature if no complexity is present.

C_2=C_B+C_1, C_3=C_B+C_1+C_2, … C_n=C_B+sum{i=1}{n}{C_i}

where: C_B is the base complexity before adding any new features.

Let’s assume that an investment, I, is made to reduce the complexity from C_b+C_N (with C_N=sum{i=1}{n}{C_i}) to C_B+C_N-C_R, where C_R is the reduction in the complexity achieved. The minimum condition for this investment to be worthwhile is that:

I+K_{r2} < K_{r1} or I < K_{r1}-K_{r2}

where: K_{r2} is the total cost of adding new features to the source code after the investment, and K_{r1} is the total cost of adding the same new features to the source code as it existed immediately prior to the investment.

Resetting the feature count back to 1, we have:

K_{r1}=(C_B+C_N+C_1)*F_1+(C_B+C_N+C_2)*F_2+...+(C_B+C_N+C_m)*F_m
and
K_{r2}=(C_B+C_N-C_R+C_1)*F_1+(C_B+C_N-C_R+C_2)*F_2+...+(C_B+C_N-C_R+C_m)*F_m

and the above condition becomes:

I < ((C_B+C_N+C_1)-(C_B+C_N-C_R+C_1))*F_1+...+((C_B+C_N+C_m)-(C_B+C_N-C_R+C_m))*F_m

I < C_R*F_1 ...+C_R*F_m

I < C_R*sum{i=1}{m}{F_i}

The decision on whether to invest in refactoring boils down to estimating the reduction in complexity likely to be achieved (as measured by effort), and the expected cost of future additions to the system.

Software systems eventually stop being used. If it looks like the software will continue to be used for years to come (software that is actively used will have users who want new features), it may be cost-effective to refactor the code to returning it to a less complex state; rinse and repeat for as long as it appears cost-effective.

Investing in software that is unlikely to be modified again is a waste of money (unless the code is intended to be admired in a book or course notes).

Chasm City

Chasm City

Alistair Reynolds
ISBN-13‏: ‎ 978-0575083158

Following the announcement of the release of Inhibitor Phase and then Elysium Fire I’ve been rereading some of the previous Revelation Space novels to pick up the thread. First time around I found Chasm City a dark story and it was no different the second time, but I got so much more out of it. I also remember losing the thread towards the end the first time, but not this time!

As with most of the series, the thread of the main story is inconsequential to the main Revelation Space arc. It’s the other aspects of the story which tie up with other Revelation Space events which make this such a fantastic book. By the time I read the last page I knew that Sky's Edge was named after the edge Sky Hassausman had over the other ships in the flotilla which settled the planet. I knew that the war had started between the ships of the flotilla and what they were fighting about. I knew how the Melding Plague had got to Chasm City and how it was discovered and spread. I knew that Sky had met Khouri, who is an important character in the main trilogy. And more, much more! I wish I knew how the Melding Plague came to be though.

I read the second half of the book in about two weeks. I just couldn’t put it down!

Pathfinder – baron m.

Welcome Sir R-----! Pray shed your overcoat and come dry yourself by the fire. I am told that these spring showers are of inestimable benefit to farming folk, but I fail to grasp why they can’t show the good manners to desist until noblemen have made their way indoors.

Will you join me in a warming measure and perchance a small wager?

I had no doubt sir!

I’ve a mind for a game oft played by the tribesmen of Borneo upon the cobbled floors of their homes as a means of practice for their legendary talent in forging paths through the dense forests in which they dwell.

The Excess Strategic WIP problem

Try pouring a bottle of milk into a glass with milk already in it. You have a choice: stop or tidy up the mess afterwards. That is my work-in-progress (WIP) analogy, if you try and do too much – no mater how much you want to do it – you will end up with a mess.

Agile folk are well versed in the problems created by too much WIP and how to deal with it – check out the Stockless Production video if you want to see. In the last six months I’ve been seeing a particular variant on this problem with I’ve come to label the Excess Strategic WIP problem.

In the latest report the manager told me how the team completed a great quarter with 3 priorities – set via OKRs. The senior management team were so impressed they asked for 19 priorities in the next period.

Right now I don’t have a “paint by numbers” solution, fixing this problem is more involved. I’m starting to understand it and I’ll make some suggestions later. Ultimately this a failure of leadership to say “No”. That failure is itself rooted in a failure of leadership to appreciate what is happening on the ground and what doing the worker are up against: the number of strategic initiatives or the amount of business as usual.

In a team it is easy to spot excess WIP using a visual system like a whiteboard or card system. When you track work like this you see an awful lot of work sitting in the “in progress” column. Typically there is more work than people on the team and the work isn’t moving. Some of the work may be actively marked as blocked but more likely most of it is nominally being worked on.

It can’t all be worked on at the same time because despite having two eyes, two hands, and two sides of a brain human’s can only really do one thing at a time – even new parents. I label this work “WHIP” – work hopefully in progress. While it is, in theory, being worked on, most of it is just sitting there waiting for a multi-tasking (i.e. time slicing) person to come back to it.

The good news is when you can see the WHIP you are half way to solving the problem: there are well known solutions. Accept less work, impose work in progress limits, sequence work by adding queues to the board, educate people to work on one thing until it is done, etc. Once you can see the work you have a feedback mechanism, you can take action and, thanks to the feedback mechanism, watch it reduce.

But Strategic WIP is more difficult. Strategic WIP is the stuff the organization decides in really important, the stuff the most senior leadership decides should be happening. The Excess Strategic WIP Problem occurs when those strategic priorities are greater than the organisation’s ability to deliver on them.

While some of this work may be transactional (“Build a Mega Widget”) much of it is transformational (“Adopt Agile working”, “Increase diversity”, “Tighten security”). Such things involve changing the way other work is done: changing the processes, changing the criteria, increasing awareness. The feedback cycle is long but to get started people need to devote time: attend training, arrange kick-off meetings, discuss approaches with consultants/coaches, etc.

In one case I saw this year the organization has, for years, been asked to do more than it is resourced to do. While a few months of such a mismatch might have been manageable the cumulative effect has been to create organizational debt and demoralised staff.

The second case I saw isn’t a result of under resourcing, if anything that organization has too much – money, people and perhaps equipment then they know what to do with. But because their management model resembles a sponge it is impossible to know when the organization has taken on too much. While some pockets were overworked I’m sure other pockets were idle.

In both these cases “lean” was a dirty word. Both organizations had been subject to lean programmes that had stripped out “waste” but, from what I could see, removing that waste had created both organizational debt and demoralised staff. Having removed “spare” staff those left were juggling. Staff didn’t have time for “agile” and their diaries had no space for daily essentials let alone another change programme.

The third case came to light when discussing OKRs. The company had a successful quarter were they focused on a few objectives that success seems to have bred the next failure when the organization requested many objectives.

Actually, this case brought it home: taking on too many, or being given too many, OKRs. I’ve heard of teams tackling too many OKRs so many times in the last year. (In fact, I should name this “Excess OKRs Problem”. This occurs whenever you have more OKRs than you can count on one hand. Don’t tell me your team is big enough to do so, check out Focus is not divisible so limit you OKRs.)

In a way, the Excess OKRs Problem is better than the Excess Strategic WIP Problem because you can see it. One can say “OK company, you have asked us to deliver 15 OKRs here, we need to sit down and talk about this.” Like visualising your work in progress excess OKRs is a thing you can identify and address, it is the reason to talk.

(Note: unlike User Stories which should always be small, OKRs should never be small. OKRs should be big, meaningful and preferably strategic. No team can take on 16 meaningful OKRs, even 5 is too many.)

One of the problems with excess Strategic WIP is that it can be difficult to see, there are different teams involved and in different places. Conflicting priorities are hidden. People on the ground may see problems but the senior people – the people creating the WIP – are too far removed. Those people may not want to hear people saying “You are asking too much”. They may have too much riding on getting multiple work streams done. They may be deaf to the cry of pain when people say “too much.” They may have too big an ego to accept that what they are asking for is a problem. And it may be politically unacceptable.

“Political” is an important word here: several of the cases I’ve heard of, and some of those example above, are Government agencies.

Because excess strategic WIP is difficult to see it is difficult to build a feedback loop and difficult to take action. How do you know when the problem is too much WHIP and when people are “crying wolf” ? – which in itself implies a problem of trust and maybe a belief that “everyone is lazy, we need to push harder.”

By its nature “strategy” is big, which means that the feedback cycles are long and the problems of excess strategic WIP take time to play out. What is a WIP problem looks like another failed strategy.

While I would like to think OKRs can help with this situation – because they force teams and organizations to take stock of what they are working on – they may be making things worse because OKRs include an ambition agenda. Teams are encouraged to “shoot for the moon” and build “10x solutions”. There is good logic here: if one aims for a “10x solution” (i.e. a solution 10 times better than the status quo) and falls short the “failure” may still be “better” (e.g. “5x”) than if one had aimed for a “2x solution” and succeeded.

Ambition with OKRs should not be about doing more OKRs, rather ambition is within the OKRs, a challenge that makes you approach it differently. One can draw a line here between having one OKR which aims for “10x” and having 10 OKRs but I suspect the subtlety will be lost on someone asking for “more than you think you can do.”

So whats is the solution?

I’m not sure there is a silver bullet but I would want to … make the problem visible, perhaps a portfolio level kanban board. I would want to build a feedback loop so I could measure change. I would want to show the strategic people the visualisation.

Management education has a part to play too. I can believe many senior managers would benefit from understand what WIP is, the problems excess WIP can cause, the way excess WIP plays out on a day-to-day basis and how it effects people’s working lives. And perhaps most of all, address trust and the belief that “we just need to push harder.”

That might also mean some of the Lean waste lessons and OKR ambition lessons need to be revisited.


Subscribe to my blog newsletter and download Continuous Digital for free

The post The Excess Strategic WIP problem appeared first on Allan Kelly.

A new career in software development: advice for non-youngsters

Lately I have been encountering non-young people looking to switch careers, into software development. My suggestions have centered around the ageism culture and how they can take advantage of fashions in software ecosystems to improve their job prospects.

I start by telling them the good news: the demand for software developers outstrips supply, followed by the bad news that software development culture is ageist.

One consequence of the preponderance of the young is that people are heavily influenced by fads and fashions, which come and go over less than a decade.

The perception of technology progresses through the stages of fashionable, established and legacy (management-speak for unfashionable).

Non-youngsters can leverage the influence of fashion’s impact on job applicants by focusing on what is unfashionable, the more unfashionable the less likely that youngsters will apply, e.g., maintaining Cobol and Fortran code (both seriously unfashionable).

The benefits of applying to work with unfashionable technology include more than a smaller job applicant pool:

  • new technology (fashion is about the new) often experiences a period of rapid change, and keeping up with change requires time and effort. Does somebody with a family, or outside interests, really want to spend time keeping up with constant change at work? I suspect not,
  • systems depending on unfashionable technology have been around long enough to prove their worth, the sunk cost has been paid, and they will continue to be used until something a lot more cost-effective turns up, i.e., there is more job security compared to systems based on fashionable technology that has yet to prove their worth.

There is lots of unfashionable software technology out there. Software can be considered unfashionable simply because of the language in which it is written; some of the more well known of such languages include: Fortran, Cobol, Pascal, and Basic (in a multitude of forms), with less well known languages including, MUMPS, and almost any mainframe related language.

Unless you want to be competing for a job with hordes of keen/cheaper youngsters, don’t touch Rust, Go, or anything being touted as the latest language.

Databases also have a fashion status. The unfashionable include: dBase, Clarion, and a whole host of 4GL systems.

Be careful with any database that is NoSQL related, it may be fashionable or an established product being marketed using the latest buzzwords.

Testing and QA have always been very unsexy areas to work in. These areas provide the opportunity for the mature applicants to shine by highlighting their stability and reliability; what company would want to entrust some young kid with deciding whether the software is ready to be released to paying customers?

More suggestions for non-young people looking to get into software development welcome.

Return of the sprint goal? (Infographic)

Most of the sprint goals I’ve ever seen are rubbish. Pretty much “do the random collection of stuff we’ve decided already.” Such goals are meaningless – save time by skipping them.

If a team adopts OKRs then I really hope they move towards more goal based working, in which case the sprint goal starts to have meaning again – although maybe the OKR replaces the sprint goal?

Either way, I see an opportunity to move away from backlog driven development (BLDD) and towards a more purposeful style of working. So it was interesting a few weeks ago when Gareth Davies from Parabol (online meeting exercises and such) sent me over this infographic and a link to a blog on Sprint goals. Food for thought.


Subscribe to my blog newsletter and download Continuous Digital for free

The post Return of the sprint goal? (Infographic) appeared first on Allan Kelly.

How I write books (my new book)

Regular readers will know I write books – quite a few by now, it gets embarrassing.

Being an author is a great conversation starter, when people hear you’ve written a book or two they want to know more – everyone seems to have a dream of writing their own book. It also means that people seek me out to ask my advice about a book they are writing, or thinking about writing.

So, I’ve started to write down all the advice I give to people in a new book – How I write books.

I’m following my usual pattern so you can buy early versions on LeanPub now – I released it last week and it immediately sold a few copies. As usual at this stage everything is in a state of flux; spelling, punctuation, grammar and all that jazz will be fixed later. Of course, anyone buying now will get free updates as they become available all the way to the final version.

If you do buy, then please let me know what you think.


Subscribe to my blog newsletter and download Continuous Digital for free

The post How I write books (my new book) appeared first on Allan Kelly.

Evaluating estimation performance

What is the best way to evaluate the accuracy of an estimation technique, given that the actual values are known?

Estimates are often given as point values, and accuracy scoring functions (for a sequence of estimates) have the form S=1/n sum{i=1}{n}{S(E_i, A_i)}, where n is the number of estimated values, E_i the estimates, and A_i the actual values; smaller S is better.

Commonly used scoring functions include:

  • S(E, A)=(E-A)^2, known as squared error (SE)
  • S(E, A)=delim{|}{E-A}{|}, known as absolute error (AE)
  • S(E, A)=delim{|}{E-A}{|}/A, known as absolute percentage error (APE)
  • S(E, A)=delim{|}{E-A}{|}/E, known as relative error (RE)

APE and RE are special cases of: S(E, A)=delim{|}{1-(A/E)^{beta}}{|}, with beta=-1 and beta=1 respectively.

Let’s compare three techniques for estimating the time needed to implement some tasks, using these four functions.

Assume that the mean time taken to implement previous project tasks is known, E_m. When asked to implement a new task, an optimist might estimate 20% lower than the mean, E_o=E_m*0.8, while a pessimist might estimate 20% higher than the mean, E_p=E_m*1.2. Data shows that the distribution of the number of tasks taking a given amount of time to implement is skewed, looking something like one of the lines in the plot below (code):

Two example distributions of number of tasks taking a given amount of time to implement.

We can simulate task implementation time by randomly drawing values from a distribution having this shape, e.g., zero-truncated Negative binomial or zero-truncated Weibull. The values of E_o and E_p are calculated from the mean, E_m, of the distribution used (see code for details). Below is each estimator’s score for each of the scoring functions (the best performing estimator for each scoring function in bold; 10,000 values were used to reduce small sample effects):

    SE   AE   APE   RE
E_o 2.73 1.29 0.51 0.56
E_m 2.31 1.23 0.39 0.68
E_p 2.70 1.37 0.36 0.86

Surprisingly, the identity of the best performing estimator (i.e., optimist, mean, or pessimist) depends on the scoring function used. What is going on?

The analysis of scoring functions is very new. A 2010 paper by Gneiting showed that it does not make sense to select the scoring function after the estimates have been made (he uses the term forecasts). The scoring function needs to be known in advance, to allow an estimator to tune their responses to minimise the value that will be calculated to evaluate performance.

The mathematics involves Bregman functions (new to me), which provide a measure of distance between two points, where the points are interpreted as probability distributions.

Which, if any, of these scoring functions should be used to evaluate the accuracy of software estimates?

In software estimation, perhaps the two most commonly used scoring functions are APE and RE. If management selects one or the other as the scoring function to rate developer estimation performance, what estimation technique should employees use to deliver the best performance?

Assuming that information is available on the actual time taken to implement previous project tasks, then we can work out the distribution of actual times. Assuming this distribution does not change, we can calculate APE and RE for various estimation techniques; picking the technique that produces the lowest score.

Let’s assume that the distribution of actual times is zero-truncated Negative binomial in one project and zero-truncated Weibull in another (purely for convenience of analysis, reality is likely to be more complicated). Management has chosen either APE or RE as the scoring function, and it is now up to team members to decide the estimation technique they are going to use, with the aim of optimising their estimation performance evaluation.

A developer seeking to minimise the effort invested in estimating could specify the same value for every estimate. Knowing the scoring function (top row) and the distribution of actual implementation times (first column), the minimum effort developer would always give the estimate that is a multiple of the known mean actual times using the multiplier value listed:

                   APE   RE
Negative binomial  1.4   0.5
Weibull            1.2   0.6

For instance, management specifies APE, and previous task/actuals has a Weibull distribution, then always estimate the value 1.2*E_m.

What mean multiplier should Esta Pert, an expert estimator aim for? Esta’s estimates can be modelled by the equation Act*U(0.5, 2.0), i.e., the actual implementation time multiplied by a random value uniformly distributed between 0.5 and 2.0, i.e., Esta is an unbiased estimator. Esta’s table of multipliers is:

                   APE   RE
Negative binomial  1.0   0.7
Weibull            1.0   0.7

A company wanting to win contracts by underbidding the competition could evaluate Esta’s performance using the RE scoring function (to motivate her to estimate low), or they could use APE and multiply her answers by some fraction.

In many cases, developers are biased estimators, i.e., individuals consistently either under or over estimate. How does an implicit bias (i.e., something a person does unconsciously) change the multiplier they should consciously aim for (having analysed their own performance to learn their personal percentage bias)?

The following table shows the impact of particular under and over estimate factors on multipliers:

                 0.8 underestimate bias   1.2 overestimate bias
Score function          APE   RE            APE   RE
Negative binomial       1.3   0.9           0.8   0.6
Weibull                 1.3   0.9           0.8   0.6

Let’s say that one-third of those on a team underestimate, one-third overestimate, and the rest show no bias. What scoring function should a company use to motivate the best overall team performance?

The following table shows that neither of the scoring functions motivate team members to aim for the actual value when the distribution is Negative binomial:

                    APE   RE
Negative binomial   1.1   0.7
Weibull             1.0   0.7

One solution is to create a bespoke scoring function for this case. Both APE and RE are special cases of a more general scoring function (see top). Setting beta=-0.7 in this general form creates a scoring function that produces a multiplication factor of 1 for the Negative binomial case.

A Review: Incineration Fest 2022 – Metal is back!

Overall I really enjoyed Incineration Fest and would go again if the line up is right for me. What was really great was seeing metallers back at gig with no restrictions and doing what we do best!

Winterfylleth

I completely fell in love with Winterfylleth when they played Bloodstock on the mainstage and even more so when they released the set as a live album. They are incredible and totally deserved to be opening proceedings at the Roundhouse for Incineration Fest. Actually, they deserved to be much higher up the bill. They’re a solid outfit, played what I wanted to hear and ended, as I always think of them ending from the live album, with Chris saying this is the last song “as time is short and our songs are long!” I need to see them do a headline set in a venue with a great PA soon.

Tsjuder

Tsjuder was the wildcard for me. I didn’t really know them and had heard only a few things on Spotify before, although what I heard was really good. I had no idea I was going to be blown away. They sounded incredible from the first note, which was even more impressive given that they are only a three piece and the PA in the Roundhouse wasn’t turning out to be great for definition.

Bloodbath

Bloodbath was really the reason I was at Incineration Fest. I’d missed them at Bloodstock ten years before as one of my sons was being born and I hadn’t had a chance to see them until now. Of course now Nick Holmes (Paradise Lost) rather than Mikael Åkerfeldt (Opeth) was on lead vocals.

I was very, very excited and from the moment I heard that trademark crunching guitar sound I was even more excited. They played for a full hour. Unlike the Black Metal bands on the bill there was more riffing and solos and a slight different drum sound.

They’re an odd band to watch. For reasons I don’t understand, the bass player and two guitarists would often turn their backs to the audience to face the dummer. The band didn’t seem to interact much with each other on stage and even less so with Nick.

Nick’s deadpan humour was present when he did speak to the audience. He introduced the band as being from Sweden, then added from Halifax almost as an afterthought! During the set he admitted he couldn’t see and dispensed with his sun glasses as they’d apparently been a good idea backstage. After breaking the microphone he enquired if it would be added to his bill at the end of the night.

Emperor

Emperor hasn't released any new material (that I know of) since 2001 and, if I’m honest, I barely listen to them beyond the live album these days. I’ve seen them at least three times before, the first time being in 1999 in a small club in Bradford on my birthday - it doesn’t get much better than that. I’m more of a fan of Ihsahn’s solo stuff these days and I still really enjoy Samoth’s Zyklon whenever I play it. Emperor, not so much anymore.

They played for the full ninety and for the most part were solid as you might expect. Whether or not Faust plays with is of no consequence to me and I certainly didn’t need to covers they played with him towards the end of the set. There was lots I knew and lots I enjoyed, but I wouldn't make an effort to see Emperor again.







The Middle Way – a.k.

A few years ago we spent some time implementing a number of the sorting, searching and set manipulation algorithms from the standard C++ library in JavaScript. Since the latter doesn't support the former's abstraction of container access via iterators we were compelled to restrict ourselves to using native Array objects following the conventions of its methods, such as slice and sort.
In this post we shall take a look at an algorithm for finding the centrally ranked element, or median, of an array, which is strongly related to the ak.nthElement function, and then at a particular use for it.

Twitter and evidence-based software engineering

This year’s quest for software engineering data has led me to sign up to Twitter (all the software people I know, or know-of, have been contacted, and discovery through articles found on the Internet is a very slow process).

@evidenceSE is my Twitter handle. If you get into a discussion and want some evidence-based input, feel free to get me involved. Be warned that the most likely response, to many kinds of questions, is that there is no data.

My main reason for joining is to try and obtain software engineering data. Other reasons include trying to introduce an evidence-based approach to software engineering discussions and finding new (to me) problems that people want answers to (that are capable of being answered by analysing data).

The approach I’m taking is to find software engineering tweets discussing a topic for which some data is available, and to jump in with a response relating to this data. Appropriate tweets are found using the search pattern: (agile OR software OR "story points" OR "story point" OR "function points") (estimate OR estimates OR estimating OR estimation OR estimated OR #noestimates OR "evidence based" OR empirical OR evolution OR ecosystems OR cognitive). Suggestions for other keywords or patterns welcome.

My experience is that the only effective way to interact with developers is via meaningful discussion, i.e., cold-calling with a tweet is likely to be unproductive. Also, people with data often don’t think that anybody else would be interested in it, they have to convinced that it can provide valuable insight.

You never know who has data to share. At a minimum, I aim to have a brief tweet discussion with everybody on Twitter involved in software engineering. At a minute per tweet (when I get a lot more proficient than I am now, and have workable templates in place), I could spend two hours per day to reach 100 people, which is 35,000 per year; say 20K by the end of this year. Over the last three days I have managed around 10 per day, and obviously need to improve a lot.

How many developers are on Twitter? Waving arms wildly, say 50 million developers and 1 in 1,000 have a Twitter account, giving 50K developers (of which an unknown percentage are active). A lower bound estimate is the number of followers of popular software related Twitter accounts: CompSciFact has 238K, Unix tool tips has 87K; perhaps 1 in 200 developers have a Twitter account, or some developers have multiple accounts, or there are lots of bots out there.

I need some tools to improve the search process and help track progress and responses. Twitter has an API and a developer program. No need to worry about them blocking me or taking over my business; my usage is small fry and I’ not building a business to take over. I was at Twitter’s London developer meetup in the week (the first in-person event since Covid) and the youngsters present looked a lot younger than usual. I suspect this is because the slightly older youngsters remember how Twitter cut developers off at the knee a few years ago by shutting down some useful API services.

The Twitter version-2 API looks interesting, and the Twitter developer evangelists are keen to attract developers (having ‘wiped out’ many existing API users), and I’m happy to jump in. A Twitter API sandbox for trying things out, and there are lots of example projects on Github. Pointers to interesting tools welcome.

The only thing you can do wrong, and the opposite of agile

The only thing you can do wrong in agile is to work the same way you did 3 months ago.

Corollary: The opposite of agile is static.

“The only thing you can do wrong” is born out of two things: my belief that agile is all about learning and putting learning into action.

Second, the proliferation of “agile tests” – maturity models, things like the Nokia Scrum test and the mental models in our own minds which condemn teams . Failing these tests means you label the team something like “Agile in name only”, “ScumBut” or “ScrummerFall”. I’ve seen my own share of these teams but honestly, I don’t care. Motivation to change a bigger issue, as long a the team are trying to improve its just a question of time – you might be “AINO” but if you keep trying you will become “Kick-ass agile.” (Plus, as an agile consultant these teams are potential clients, send them to me!)

If you are learning and attempting to act on that learning you will make some false moves, you will need to undo some changes but over time you will move forward.

You need to learn in three domains: Solution domain – the tools and technology you use to craft solutions and systems; Application or problem domain – understanding the problem/opportunity you are trying to solve/exploit, understanding what customers’ need and what the market will pay for. (I tend to think of this as the demand side); and the Process domain – the way you work, your processes and practices, the most obvious place were “agile” fits in.

The boundaries between these domains are fluid you need a learning culture in all three.

More recently I realised that “the only thing you can do wrong” has a corollary:

        The opposite of agile is static, not learning and not changing

20 years ago agile advocates invented Waterfall to be the enemy – sure the cascade model, à la Royce, was industry standard but nobody, NOBODY, called it Waterfall – believe me, I’m old enough to remember first hand. Agile was described by reference to what agile was not, and that not was named “waterfall.”

But that is wrong.

The secret is in the words: Agile implies movement and action.

Not moving is called Static, so the opposite of agile is static.

Once you accept agile means movement and change the next question is: how fast?

Rather than agile tests and “agile maturity models” we should be to measuring the speed of change and the speed of improvement: how fast are the team learning and acting on that learning? how successful are they at that?

When I floated this suggestion on LinkedIn a few days ago a few people pushed back. One argument was that I was advocating “change for change’s sake”. I’m not. You are free to not change, you are free to not change if you wish, all I’m saying is don’t call that agile, ‘cos its not, that is static.

Not changing is static, and static is not agile.

Now maybe static is the right thing for you. If your team can repeat their way of working, if they are predictable, and if the team and stakeholders are content then then change! Embrace static.

I’m doubtful such a static position is anything other than a short term Stable Intermediate Form. Static implies a stable environment, one in which at all forces are in equilibrium: nobody is asking for faster delivery, nobody is changing their ask, nobody is complaining about technical debt, overwork, predictability and so on. If people are not complaining then celebrate, you are living the dream! Just don’t call it agile.

“Finished” software is static because nobody changes it. Software ages because it stays unchanged while the world around it changes.

A second argument was that constant change was not good for people. Now I’m not arguing for constant big change, or constant top-down change. In my mind agile change is largely incremental – sure sometimes you need a big change but most of the changes are small.

I also challenge the assumption that change is top-down and that change is done to people. In my experience when those doing the work are enrolled in the change process, when they can see the potential benefits then it is a different matter. In my experience “change resistance” is more likely “resistance to being changed.”

Finally, “last 3 months”. That time frame comes from my intuition, it is a period long enough to see if change has happened without demanding that the team are never repeating themselves. I imagine the team moving from one Stable Intermediate Form to another Stable Intermediate Form over those 3 months. Feel free to suggest another time frame but if you think 3 months is not long enough to see change then how long? 4 months? 6 months? 12 months?


Subscribe to my blog newsletter and download Continuous Digital for free

The post The only thing you can do wrong, and the opposite of agile appeared first on Allan Kelly.

User Stories by Example: certifcate added to the free courses

A couple of months ago I made my online User Stories by Example tutorial series free. One client suggested that the series would benefit from a certificate at the end. Good idea, I’m always open to suggestions.

So, I’ve added a new tutorial to the series: exam and certificate – rather than just give a certificate of attendance to anyone who plays the videos I’ve set a little test. There is a bank of questions which are randomly selected and should cover the five areas of the tutorials. Score over 60% on the test and you get a certificate which lists the key topics covered.

I’ve set a small fee for this, once you have paid you have 21 days to do the exam and you can sit the exam as many times as you like.

As always, if you have any suggestions or other feedback please let me know. And if you have any ideas for new questions send them over, I’d love to increase the question pool.


Subscribe to my blog newsletter and download Continuous Digital for free

The post User Stories by Example: certifcate added to the free courses appeared first on Allan Kelly.

Improving my vimrc live on stream

I was becoming increasingly uncomfortable with how crufty my neovim config was getting, and especially how I didn’t understand parts of it, so I decided to wipe it clean and rebuild it from scratch.

I did it live on stream, to make it feel like a worthwhile activity:

Headline features of the new vimrc:

  • A new theme, using the base16 theme framework
  • A file browser (NERDTree)
  • A minimalist status line with vim-airline
  • Search with ripgrep
  • Rust language support with Coc

Note: after the stream I managed to resolve the remaining issues with highlight colours not showing by triggering re-applying them after the theme has been applied:

augroup tabs_in_make
    autocmd!
    autocmd ColorScheme * highlight MatchParen cterm=none ctermbg=none ctermfg=green
augroup END

You can find my current neovim config at gitlab.com/andybalaam/configs/-/tree/main/.config/nvim.

Evidence-based Software Engineering: now in paperback form

I made my Evidence-based Software Engineering book available as a pdf file. While making a printed version available looked possible, I was uncertain that the result would be of acceptable quality; the extensive use of color and an A4 page size restricted the number of available printers who could handle it. Email exchanges with several publishers suggested that the number of likely print edition copies sold would be small (based on experience with other books, under 100). The pdf was made available under a creative commons license.

Around half-million copies of the pdf have been downloaded (some partially).

A few weeks ago, I spotted a print version of this book on Amazon (USA). I have no idea who made this available. Is the quality any good? I was told that it was, so I bought a copy.

The printed version looks great, with vibrant colors, and is reasonably priced. It sits well in the hand, while reading. The links obviously don’t work for the paper version, but I’m well practised at using multiple fingers to record different book locations.

I have one report that the Kindle version doesn’t load on a Kindle or the web app.

If you love printed books, I heartily recommend the paperback version of Evidence-based Software Engineering; it even has a 5-star review on Amazon 😉

EnvDTE::DocumentEvents crash in VS2022 v17.1 Preview 2

Heads-up for anyone writing Visual Studio extensions that it looks like EnvDTE::DocumentEvents::DocumentOpening() has a regression or change in behaviour that can cause a crash freeing a BSTR if you use it in VS2022 v17.1 Preview 2 onwards.

The interface in question is in the legacy automation interface once used by macros/addins and is now marked as "Microsoft internal use only" so I doubt this will affect anyone writing new code. Still, it's noteworthy - esp as the crash can manifest as a hang while opening a solution which makes it look like the IDE installation is corrupt.

Programming language similarity based on their traits

A programming language is sometimes described as being similar to another, more wide known, language.

How might language similarity be measured?

Biologists ask a very similar question, and research goes back several hundred years; phenetics (also known as taximetrics) attempts to classify organisms based on overall similarity of observable traits.

One answer to this question is based on distance matrices.

The process starts by flagging the presence/absence of each observed trait. Taking language keywords (or reserved words) as an example, we have (for a subset of C, Fortran, and OCaml):

            if   then  function   for   do   dimension  object
C            1     1       0       1     1       0        0
Fortran      1     0       1       0     1       1        0
OCaml        1     1       1       1     1       0        1

The distance between these languages is calculated by treating this keyword presence/absence information as an n-dimensional space, with each language occupying a point in this space. The following shows the Euclidean distance between pairs of languages (using the full dataset; code+data):

                C        Fortran      OCaml
C               0        7.615773     8.717798
Fortran      7.615773    0            8.831761
OCaml        8.717798    8.831761     0

Algorithms are available to map these distance pairs into tree form; for biological organisms this is known as a phylogenetic tree. The plot below shows such a tree derived from the keywords supported by 21 languages (numbers explained below, code+data):

Tree showing relative similarity of languages based on their keywords.

How confident should we be that this distance-based technique produced a robust result? For instance, would a small change to the set of keywords used by a particular language cause it to appear in a different branch of the tree?

The impact of small changes on the generated tree can be estimated using a bootstrap technique. The particular small-change algorithm used to estimate confidence levels for phylogenetic trees is not applicable for language keywords; genetic sequences contain multiple instances of four DNA bases, and can be sampled with replacement, while language keywords are a set of distinct items (i.e., cannot be sampled with replacement).

The bootstrap technique I used was: for each of the 21 languages in the data, was: add keywords to one language (the number added was 5% of the number of its existing keywords, randomly chosen from the set of all language keywords), calculate the distance matrix and build the corresponding tree, repeat 100 times. The 2,100 generated trees were then compared against the original tree, counting how many times each branch remained the same.

The numbers in the above plot show the percentage of generated trees where the same branching decision was made using the perturbed keyword data. The branching decisions all look very solid.

Can this keyword approach to language comparison be applied to all languages?

I think that most languages have some form of keywords. A few languages don’t use keywords (or reserved words), and there are some edge cases. Lisp doesn’t have any reserved words (they are functions), nor technically does Pl/1 in that the names of ‘word tokens’ can be defined as variables, and CHILL implementors have to choose between using Cobol or PL/1 syntax (giving CHILL two possible distinct sets of keywords).

To what extent are a language’s keywords representative of the language, compared to other languages?

One way to try and answer this question is to apply the distance/tree approach using other language traits; do the resulting trees have the same form as the keyword tree? The plot below shows the tree derived from the characters used to represent binary operators (code+data):

Tree showing relative similarity of languages based on their binary operator character representation.

A few of the branching decisions look as-if they are likely to change, if there are changes to the keywords used by some languages, e.g., OCaml and Haskell.

Binary operators don’t just have a character representation, they can also have a precedence and associativity (neither are needed in languages whose expressions are written using prefix or postfix notation).

The plot below shows the tree derived from combining binary operator and the corresponding precedence information (the distance pairs for the two characteristics, for each language, were added together, with precedence given a weight of 20%; see code for details).

Tree showing relative similarity of languages based on their binary operator character representation and corresponding precedence.

No bootstrap percentages appear because I could not come up with a simple technique for handling a combination of traits.

Are binary operators more representative of a language than its keywords? Would a combined keyword/binary operator tree would be more representative, or would more traits need to be included?

Does reducing language comparison to a single number produce something useful?

Languages contain a complex collection of interrelated components, and it might be more useful to compare their similarity by discrete components, e.g., expressions, literals, types (and implicit conversions).

What is the purpose of comparing languages?

If it is for promotional purposes, then a measurement based approach is probably out of place.

If the comparison has a source code orientation, weighting items by source code occurrence might produce a more applicable tree.

Sometimes one language is used as a reference, against which others are compared, e.g., C-like. How ‘C-like’ are other languages? Taking keywords as our reference-point, comparing languages based on just the keywords they have in common with C, the plot below is the resulting tree:

Tree showing similarity of languages based on the keywords they share with C.

I had expected less branching, i.e., more languages having the same distance from C.

New languages can be supported by adding a language file containing the appropriate trait information. There is a Github repo, prog-lang-traits, send me a pull request to add your language file.

It’s also possible to add support for more language traits.

On Pitfall – student

Recall that in the Baron's latest wager, Sir R-----'s goal was to traverse a three by three checkerboard in steps determined by casts of a four sided die, each at a cost of two coins. Moving from left to right upon the first rank and advancing to the second upon its third file, thereafter from right to left and advancing upon the first file and finally from left to right again, he should have prevailed for a prize of twenty five coins had he landed upon the top right place. Frustrating his progress, however, were the rules that landing upon a black square dropped him back down to the first rank and that overshooting the last file upon the last rank required that he should move in reverse by as many places with which he had done so.

Devin Townsend at the Royal Albert Hall (again)

Leprous

There’s an obvious pull for me towards Leprous due to the association with Ihsahn and prog, but rock bands generally do little for me these days. I listened to a little of Aphelion before the gig, but it didn’t grip me.

They’re an odd live band and some of the time the cello player looked a bit out of place when he was without his cello. The sharing of the keyboards among various band members, often in the same song, was also weird. The singer was wearing a waistcoat and doing some very odd dancing and his voice can grate. For a prog band the lack of any guitar or keyboard lead breaks was also weird.

However, I quite enjoyed Leprous!


Devin Townsend

We’d only seen Devin Townsend a few months ago (in the summer at Bloodstock), but my wife loves him so we went again. We should have gone the night before as he played loads of songs we knew, in contrast to the night we went where he played nothing we knew! Most of it, I am reliably informed, was from the Ocean Machine and Infinity albums.

Devin still plays brilliantly and it was great to see him again with the session musicians he’d teamed up with for his Bloodstock performance. He creates a fantastic wall of sound and engages with the crowd like few others. I’m sure we’ll go and see him again, after all we’ve not heard Hyperdrive live yet!

Agile OKRs extra – yet another book

I blogged last week that I had begun work on a new book – How I Write Books which is now a work in progress at LeanPub – signup and be the first to know when the draft is published.

Well a funny thing happened while I was setting up my tool chain to write that book: I found another book! Well, perhaps half a book is a better description.

Succeeding with OKRs in Agile Extra is a companion to last year’s best seller, Succeeding with OKRs in Agile. But it isn’t a complete book in its own right, it isn’t really a sequel, it is a companion. It contains a mix of material. Material which didn’t really fit in the first book, material with was’t needed, ideas which didn’t develop far enough and some unfinished chapters.

As such it is like my Xanpan Appendix, unused material which is still interesting and might appear elsewhere in time.

I really want to work on How I write books so I don’t have any immediate plans to progress extra. If you enjoyed Succeeding with OKRs in Agile, if you would like to know more, or if you would like to just see how a writer’s mind works check out Succeeding with OKRs in Agile Extra.

The post Agile OKRs extra – yet another book appeared first on Allan Kelly.

Ecology as a model for the software world

Changing two words in the Wikipedia description of Ecology gives “… the study of the relationships between software systems, including humans, and their physical environment”; where physical environment might be taken to include the hardware on which software runs and the hardware whose behavior it controls.

What do ecologists study? Wikipedia lists the following main areas; everything after the first sentence, in each bullet point, is my wording:

  • Life processes, antifragility, interactions, and adaptations.

    Software system life processes include its initial creation, devops, end-user training, and the sales and marketing process.

    While antifragility is much talked about, it is something of a niche research topic. Those involved in the implementations of safety-critical systems seem to be the only people willing to invest the money needed to attempt to build antifragile software. Is N-version programming the poster child for antifragile system software?

    Interaction with a widely used software system will have an influence on the path taken by cultures within associated microdomains. Users adapt their behavior to the affordance offered by a software system.

    A successful software system (and even unsuccessful ones) will exist in multiple forms, i.e., there will be a product line. Software variability and product lines is an active research area.

  • The movement of materials and energy through living communities.

    Is money the primary unit of energy in software ecosystems? Developer time is needed to create software, which may be paid for or donated for free. Supporting a software system, or rather supporting the needs of the users of the software is often motivated by a salary, although a few do provide limited free support.

    What is the energy that users of software provide? Money sits at the root; user attention sells product.

  • The successional development of ecosystems (“… succession is the process of change in the species structure of an ecological community over time.”)

    Before the Internet, monthly computing magazines used to run features on the changing landscape of the computer world. These days, we have blogs/podcasts telling us about the latest product release/update. The Ecosystems chapter of my software engineering book has sections on evolution and lifespan, but the material is sparse.

    Over the longer term, this issue is the subject studied by historians of computing.

    Moore’s law is probably the most famous computing example of succession.

  • Cooperation, competition, and predation within and between species.

    These issues are primarily discussed by those interested in the business side of software. Developers like to brag about how their language/editor/operating system/etc is better than the rest, but there is no substance to the discussion.

    Governments have an interest in encouraging effective competition, and have enacted various antitrust laws.

  • The abundance, biomass, and distribution of organisms in the context of the environment.

    These are the issues where marketing departments invest in trying to shift the distribution in their company’s favour, and venture capitalists spend their time trying to spot an opportunity (and there is the clickbait of language popularity articles).

    The abundance of tools/products, in an ecosystem, does not appear to deter people creating new variants (suggesting that perhaps ambition or dreams are the unit of energy for software ecosystems).

  • Patterns of biodiversity and its effect on ecosystem processes.

    Various kinds of diversity are important for biological systems, e.g., the mutual dependencies between different species in a food chain, and genetic diversity as a resource that provides a mechanism for species to adapt to changes in their environment.

    It’s currently fashionable to be in favour of diversity. Diversity is so popular in ecology that a 2003 review listed 24 metrics for calculating it. I’m sure there are more now.

    Diversity is not necessarily desired in software systems, e.g., the runtime behavior of source code should not depend on the compiler used (there are invariably edge cases where it does), and users want different editor command to be consistently similar.

    Open source has helped to reduce diversity for some applications (by reducing the sales volume of a myriad of commercial offerings). However, the availability of source code significantly reduces the cost/time needed to create close variants. The 5,000+ different cryptocurrencies suggest that the associated software is diverse, but the rapid evolution of this ecosystem has driven developers to base their code on the source used to implement earlier currencies.

    Governments encourage competitive commercial ecosystems because competition discourages companies charging high prices for their products, just because they can. Being competitive requires having products that differ from other vendors in a desirable way, which generates diversity.

How I write books – A book about books

In the last 15 years I’ve written and published 3 books with publishers, published 5 books myself, plus edited one conference proceedings and pushed out three “mini books” (one with 3 editions) which I never publicised.

In addition I’ve contributed forwards and chapters to at least six books and had two books translated.

Then there are countless magazine and journal articles but they stretch back further, closer to 25 years – and this blog from 2005.

Not bad for a kid who was thrown out of school after asking a teacher how to spell “at” – age of eight, a diagnosed dyslexic who had to learn to read three times – I can’t read my own handwriting its so bad.

As a result I’ve learned a lot about writing and publishing. In the last few years I’ve spoken to many people who want to know how to write and publish their own book. A couple of years ago Steve Smith suggested I write a book about writing books. I’ve been avoiding that until this month.

Now I’ve started: How I write books, https://leanpub.com/howIwrite – sign-up to be the first to known when the MVP is published. And if there is anything you would like me to write about in the book please let me know.

The post How I write books – A book about books appeared first on Allan Kelly.

NoEstimates panders to mismanagement and developer insecurity

Why do so few software development teams regularly attempt to estimate the duration of the feature/task/functionality they are going to implement?

Developers hate giving estimates; estimating is very hard and estimates are often inaccurate (at a minimum making the estimator feel uncomfortable and worse when management treats an estimate as a quotation). The future is uncertain and estimating provides guidance.

Managers tell me that the fear of losing good developers dissuades them from requiring teams to make estimates. Developers have told them that they would leave a company that required them to regularly make estimates.

For most of the last 70 years, demand for software developers has outstripped supply. Consequently, management has to pay a lot more attention to the views of software developers than the views of those employed in most other roles (at least if they want to keep the good developers, i.e., those who will have no problem finding another job).

It is not difficult for developers to get a general idea of how their salary, working conditions and practices compares with other developers in their field/geographic region. They know that estimating is not a common practice, and unless the economy is in recession, finding a new job that does not require estimation could be straight forward.

Management’s demands for estimates has led to the creation of various methods for calculating proxy estimate values, none of which using time as the unit of measure, e.g., Function points and Story points. These methods break the requirements down into smaller units, and subcomponents from these units are used to calculate a value, e.g., the Function point calculation includes items such as number of user inputs and outputs, and number of files.

How accurate are these proxy values, compared to time estimates?

As always, software engineering data is sparse. One analysis of 149 projects found that Cost approx FunctionPoints^{0.75}, with the variance being similar to that found when time was estimated. An analysis of Function point calculation data found a high degree of consistency in the calculations made by different people (various Function point organizations have certification schemes that require some degree of proficiency to pass).

Managers don’t seem to be interested in comparing estimated Story points against estimated time, preferring instead to track the rate at which Story points are implemented, e.g., velocity, or burndown. There are tiny amounts of data comparing Story points with time and Function points.

The available evidence suggests a relationship connecting Function points to actual time, and that Function points have similar error bounds to time estimates; the lack of data means that Story points are currently just a source of technobabble and number porn for management power-points (send me Story point data to help change this situation).

Testing Our Students – a.k.

Last time we saw how we can use the chi-squared distribution to test whether a sample of values is consistent with pre-supposed expectations. A couple of months ago we took a look at Student's t-distribution which we can use to test whether a set of observations of a normally distributed random variable are consistent with its having a given mean when its variance is unknown.

Learnings from Decapitated

When am I going to learn? 

The first two times I saw Decapitated, Bloodstock and then supporting someone in Norwich at the Waterfront, they were incredible. 

In early 2020 in London they sounded awful and we left. Last night in Norwich they sounded terrible again. 

I don’t know if it was them or the sound system, but there was no definition. It was all drums, vocals and not much else, so we gave up halfway through. 

I’m hoping the new album will be amazing, they’ll play at the UEA and it will be amazing. However, if I am going to learn, then I won’t be risking it.

Its the engineers, stupid – one from the heart

When I engage with company and teams I’m always keen – nee desperate – to get to meet the engineers and teams who are doing the work. If days, maybe even weeks, go by and I’m not doing that I get very frustrated. More importantly I’m not sure what to believe from those I am talking to.

There was once a bank I spent time with. As soon as I got to the office I discovered almost all the engineers were in a far away country and I wasn’t going to get to visit that country. The few engineers in the London office spent a lot of their time hand-holding those in the far away place. When you looked closely, when you spoke to the engineers far away you found things didn’t add up. One delivered a perfect 10 story points every iteration without fail. Another team increased velocity sprint after sprint. One engineer fell off his moped and broke his arm, the work was still delivered on time – it took all my wiles to discover another engineer had worked all weekend to meet the deadline.

Why am I so desperate to meet the engineers? – well there are several reasons, some more rational than others.

First off, the engineers are where the work happens. In lean parlance they are the gemba, source of truth.

Second, these are the people who will need to change or be changed. There is only so much you can change with an organigram – and to be honest, I’m doubtful reorgs really change much. Sometimes I imagine managers moving their workers around like pawns on a chess board while the reality of work is hand-to-hand combat.

Thirdly, and perhaps most importantly for me: I see my role as helping these people. I am, by profession, by temperament and by ancestry an engineer. I am motivated by the desire to help those who do the work have a more fulfilling life. I still remember the frustrations I faced as a coding software engineer.

Thats why it hurts – really hurts – when engineers tell me “agile is rubbish”, that “agile has nothing to offer”, when they tell me that I’m not helping. Its not that I’m precious about agile, “agile” is just the toolset I’ve found helps. I also know that tool kit allows me to go outside the toolkit.

I was hired by a Californian company to give agile training to their Cambridge team. A few minutes in, one of the engineers told me directly “Agile can’t help us here, we can’t go any lower.” The other engineers in the room were of the same opinion. It turned out the managers had been to Scrum training and come back pumped up about high performing teams and faster-better-cheaper. Sustainable pace, autonomy and quality weren’t on the table.

That hurt and it may have been the toughest training gig I’ve ever had but I think I turned it around. I demonstrated the need for quality and explained the managers were missing essential parts of the puzzle. Unfortunately I didn’t get to meet the managers – they were off playing chess.

But I do engage with managers. Often they are the route to the engineers. Unfortunately some engineers see that as a problem in itself: “our problem is tech debt, sprinting won’t help us” so I’m discounted. In my world – the world of Xanpan – sprinting is a rod you put up your back to make yourself better, if you don’t address quality (e.g. tech debt) issues then you won’t succeed at time-boxed iterations.

(BTW I talk about engineers because most of my work is with engineers, and software engineers at that. I’ve worked a little with other professions and I’m sure most of what I say carries across directly but my experience and empathy is greatest with engineers.)

To deal with managers one needs to understand their concerns, one needs to listen and speak in ways they understand. Engineers may struggle with managers and technical issues but managers also struggle with their managers, organizational debt, customers and the market.

The same is true when I wonder over into the world of product ownership – Product Managers and Business Analysts. Engineers have a bad habit of seeing these roles as “Management” but if you spend time with the “demand side” people you find their concerns are almost identical to coding engineers. BAs worry that what they are being asked to do is unreasonable, that it doesn’t make sense, that something else needs to change first and that people don’t appreciate how things really work. The biggest difference between programmers and BAs is simply that, on average, BAs dress more smartly and are more likely to put on a tie.

One can’t understand a system and one can’t get to the truth if one can’t visit the place where work happens. When manufacturing things that place is the production line, in the digital world that place is the mind. Constructing software is an intellectual exercise that happens in the mind and is only manifested via a keyboard in code. To see the truth one has to speak to engineers.

I’ve seen some awful work environments: a room packed with 28 engineers, very few windows, little fresh air, a development manager on a raised platform at one end, the HR manager at the other end, her desk right by the single door in and out with the clock-in-clock-out cards on the wall.

More recently a large project at a matrix managed organization. The complexity made it difficult to know who was actually on the project and what teams existed. Management existed in its own bubble.

I feel pain simply seeing such places. What it can be like to work there I can only imagine. I assume people become dumb to the pain, switch off to the failing and accept the normalisation of deviance. Or, to put it another way: a culture of failure.

Both of these two examples shared one thing in common: massive Gantt charts which claimed to plan the work. In one case I saw someone scheduled to spend a month writing a manual in two years time. While these charts claim rationality they are so disconnected with the gemba as to be fantasies. I feel cognitive dissonance knowing that the managers who put their faith in such mechanisms are both rational and totally mad.

Encountering such places is painful for me. On the one hand I want to help, I want to make the engineers lives better – that is what I do! The challenge can be great. On the other hand it can be mentally and emotionally draining. Because I am passionate about what I do I feel that. If I switched off, if I treated it as a money paying gig then I too become part of the same culture and loose my efficacy.

On the other hand, when things go right I love it – perhaps because I’m an engineer and I see fixing the organization as a way of fixing the code, its called Conway’s Law.


Subscribe to my blog newsletter and download Continuous Digital for free

The post Its the engineers, stupid – one from the heart appeared first on Allan Kelly.

Réussir avec les OKR en Agile (French OKRs)

I am delighted to say The French translation of Succeeding with OKRs in AgileRéussir avec les OKR en Agile – is now available thanks to the hard work of Nicolas Mereaux and Fabrice Aimetti.

The book is available right now on LeanPub as an e-book. After Easter we’ll start work on getting it a print version available.

Until then a big thanks to Nicolas and Fabrice!

(Please get in touch if you are interested in translating the book to your favourite language.)

The post Réussir avec les OKR en Agile (French OKRs) appeared first on Allan Kelly.

4,000 vs 400 vs 40 hours of software development practice

What is the skill difference between professional developers and newly minted computer science graduates?

Practice, e.g., 4,000 vs. 400 hours

People get better with practice, and after two years (around 4,000 hours) a professional developer will have had at least an order of magnitude more practice than most students; not just more practice, but advice and feedback from experienced developers. Most of these 4,000 hours are probably not the deliberate practice of 10,000 hours fame.

It’s understandable that graduates with a computing degree consider themselves to be proficient software developers; this opinion is based on personal experience (i.e., working with other students like themselves), and not having spent time working with professional developers. It’s not a joke that a surprising number of academics don’t appreciate the student/professional difference, the problem is that some academics only ever get to see a limit range of software development expertise (it’s a question of incentives).

Surveys of student study time have found that for Computer science, around 50% of students spend 11 hours or more, per week, in taught study and another 11 hours or more doing independent learning; let’s take 11 hours per week as the mean, and 30 academic weeks in a year. How much of the 330 hours per year of independent learning time is spent creating software (that’s 1,000 hours over a three-year degree, assuming that any programming is required)? I have no idea, and picked 40% because it matched up with 4,000.

Based on my experience with recent graduates, 400 hours sounds high (I have no idea whether an average student spends 4-hours per week doing programming assignments). While a rare few are excellent, most are hopeless. Perhaps the few hours per week nature of their coding means that they are constantly relearning, or perhaps they are just cutting and pasting code from the Internet.

Most graduates start their careers working in industry (around 50% of comp sci/maths graduates work in an ICT profession; UK higher-education data), which means that those working in industry are ideally placed to compare the skills of recent graduates and professional developers. Professional developers have first-hand experience of their novice-level ability. This is not a criticism of computing degrees; there are only so many hours in a day and lots of non-programming material to teach.

Many software developers working in industry don’t have a computing related degree (I don’t). Lots of non-computing STEM degrees give students the option of learning to program (I had to learn FORTRAN, no option). I don’t have any data on the percentage of software developers with a computing related degree, and neither do I have any data on the average number of hours non-computing STEM students spend on programming; I’ve cosen 40 hours to flow with the sequence of 4’s (some non-computing STEM students spend a lot more than 400 hours programming; I certainly did). The fact that industry hires a non-trivial number of non-computing STEM graduates as software developers suggests that, for practical purposes, there is not a lot of difference between 400 and 40 hours of practice; some companies will take somebody who shows potential, but no existing coding knowledge, and teach them to program.

Many of those who apply for a job that involves software development never get past the initial screening; something like 80% of people applying for a job that specifies the ability to code, cannot code. This figure is based on various conversations I have had with people about their company’s developer recruitment experiences; it is not backed up with recorded data.

Some of the factors leading to this surprisingly high value include: people attracted by the salary deciding to apply regardless, graduates with a computing degree that did not require any programming (there is customer demand for computing degrees, and many people find programming is just too hard for them to handle, so universities offer computing degrees where programming is optional), concentration of the pool of applicants, because those that can code exit the applicant pool, leaving behind those that cannot program (who keep on applying).

Apologies to regular readers for yet another post on professional developers vs. students, but I keep getting asked about this issue.

No unified theory of agile (Agile mindset cont.)

Continuing my quest of “The Agile Mindset” I’ve been searching for a metaphor to put all the different ideas on agile into order. To cut to the chase: there isn’t one.

As much as I would love to boil “agile” down to one thing, or even a handful of key concepts and I don’t think there will ever be a single unified theory of agile. As I said before with the elephant example, everyone sees something different. Depending on where you are standing, the problems you face today, your own history and area of knowledge, and your own world view you are going to see and emphasise different aspects of “The Agile Mindset.” And you know what? that is a good thing!

First I tried thinking of Agile as the layers of an onion. The outside skin is the word “agile” – there is a valid reason for saying the agile mindset is there in the dictionary definition of the word agile: able to move, think and understand quickly, be nimble and subtle in movements, alert and observant when looking at whats happening and sharp when thinking. But that is only the outside layer, it is very general.

The agile manifesto would be the second layer. While the manifesto is specific to software it doesn’t require too much thought to generalise it to other domains. Actually, it is a bit too easy and there are more attempts to generalise it than there are people who have attempted to generalise it. Plus, as I’ve written before, the manifesto is over 20 years old, those who cling to it sound like Supreme Court Justices trying to read meaning into a document written in a different age.

And what are the next layers, and in what order? Does last responsible moment come above or below cost of delay? Is test driven more important than time-boxing? Are work-in-progress limits a version of time-boxing or an alternative to time-boxing?

And what is at the centre? For me it is learning but I can imagine people who will say it is People – perhaps manifested through Weinberg’s “Its always a people problem” quote – I’ve written about that one too, the People Problem Problem. Personally I think McGregor’s Theory X and Theory Y could be a candidate, which raises the question of agile’s fellow travellers – beyond budgeting, system thinking, Lean, and Mintzberg’s theory of emergent strategy.

I wondered if agile could be thought of as a brick wall, with each idea forming a brick, and then the whole being more than the sum of the parts. But that falls down (sorry for the pun!) on the layering problem. Which ideas are foundations and which decorative?

Similarly, I toyed with The House of Agile with different ideas represented by different rooms but that metaphor quickly runs into problems too.

In a way this makes the search fo One Agile Mindset even more desirable – the search for a grand unified theory of everything if you like. There must be something out there that combines all of this!

Ye the search for the unifying theory also highlights how damn difficult this is. Intellectually it is hard to accept that “the agile mindset” is a bunch of different ideas which different people interpret differently.

But you know what? Accepting that agile is diverse is itself agile – agile is not one idea, it is many, accepting that and valuing those different ideas mean embracing diversity and that itself is agile. Agile is what you want it to be because through those diverse we find alternative ways of viewing and learning. Agile doesn’t stand still, agile is punk, at its best agile is democracy.

Unfortunately that makes it hard to explain.


Subscribe to my blog newsletter and download Continuous Digital for free

The post No unified theory of agile (Agile mindset cont.) appeared first on Allan Kelly.

Anthropological studies of software engineering

Anthropology is the study of humans, and as such it is the top level research domain for many of the human activities involved in software engineering. What has been discovered by the handful of anthropologists who have spent time researching the tiny percentage of humans involved in writing software?

A common ‘discovery’ is that developers don’t appear to be doing what academics in computing departments claim they do; hardly news to those working in industry.

The main subfields relevant to software are probably: cultural anthropology and social anthropology (in the US these are combined under the name sociocultural anthropology), plus linguistic anthropology (how language influences social life and shapes communication). There is also historical anthropology, which is technically what historians of computing do.

For convenience, I’m labelling anybody working in an area covered by anthropology as an anthropologist.

I don’t recommend reading any anthropology papers unless you plan to invest a lot of time in some subfield. While I have read lots of software engineering papers, anthropologist’s papers on this topic are often incomprehensible to me. These papers might best be described as anthropology speak interspersed with software related terms.

Anthropologists write books, and some of them are very readable to a more general audience.

The Art of Being Human: A Textbook for Cultural Anthropology by Wesch is a beginner’s introduction to its subject.

Ethnography, which explores cultural phenomena from the point of view of the subject of the study, is probably the most approachable anthropological research. Ethnographers spend many months living with a remote tribe, community, or nowadays a software development company, and then write-up their findings in a thesis/report/book. Examples of approachable books include: “Engineering Culture: Control and Commitment in a High-Tech Corporation” by Kunda, who studied a large high-tech company in the mid-1980s; “No-Collar: The Humane Workplace and its Hidden Costs” by Ross, who studied an internet startup that had just IPO’ed, and “Coding Freedom: The Ethics and Aesthetics of Hacking” by Coleman, who studied hacker culture.

Linguistic anthropology is the field whose researchers are mostly likely to match developers’ preconceived ideas about what humanities academics talk about. If I had been educated in an environment where Greek and nineteenth century philosophers were the reference points for any discussion, then I too would use this existing skill set in my discussions of source code (philosophers of source code did not appear until the twentieth century). Who wouldn’t want to apply hermeneutics to the interpretation of source code (the field is known as Critical code studies)?

It does not help that the software knowledge of many of the academics appears to have been acquired by reading computer books from the 1940s and 1950s.

The most approachable linguistic anthropology book I have found, for developers, is: The Philosophy of Software Code and Mediation in the Digital Age by Berry (not that I have skimmed many).

Letter to Anneliese Dodds on the invasion of Ukraine by Russia

Dear Anneliese Dodds,

I learn from the BBC (https://www.bbc.co.uk/news/58888451) that "The UK is to phase out Russian oil by the end of the year" and "Russian imports account for 8% of total UK oil demand".

8% is a small amount and the end of the year is a long time in the future. We need immediate action to change Russia's course. Please use all your influence to this end.

Some suggestions, as a minimum:

  • Stop all petrochemical purchases from Russia, and requiring this of multinationals
  • Expulsion of all remaining Russian banks from SWIFT
  • Make it unlawful to insure a Russian enterprise
  • Seizure and forfeiture of all Russian assets within the UK and its dominions
  • Motion to remove Russia from the UN Security Council

There are many more things which could and should be done, by January 2023 there will be no Ukraine to defend.

Yours sincerely,
Tim Pizey

Galactic North (a review)

Galactic North

Alastair Reynolds
ISBN-13: ‎ 978-0575083127

Galactic North is a group of short stories set in the Revelation Space universe starting at it’s very beginning and stretching right to it’s end.


Great Wall of Mars


I reread the Great Wall of Mars after the Inhibitor Phase to remind me of some of Warren Clavian’s back story. It didn’t disappoint. I should have read Great Wall of Mars again before Inhibitor Phase, but hindsight is a wonderful thing. At least I’ve remembered why Nevile hated his brother and how he was betrayed by him, why Nevile defected to the conjoiners and how Felka fits in. One small story explains so much of why things happened in several of the other stories including Absolution Gap.


Glacial
 

Glacial adds little to the overall story, but does help to explain how the relationship between Clavian, Galiana and Felka developers and how it becomes so strong. Glacial is really an opportunity for Alastair Reynolds to explore the concept of a thinking, possible sentient planet.  He does this, as always, by hinting throughout at the bigger picture and keeping you reading.


A Spy in Europa

 
Not sure what I think of this one. Seemed a bit pointless. Not very nice characters who all stabbed each other in the back. Some interesting science tho and provided a backdrop and context for Grafenwalder's Bestiary.


Weather

 
What a fantastic standalone story this is with some great characters who demonstrate that not all Ultras are cut-throat. There’s lots more detail about conjoiners here and the secret of how C-drives are managed is revealed, but that’s not the darkest secret.


Dilation Sleep

 
I was disappointed in this story until I read the notes at the end and realised it was the first story written in the revelation space universe and that it introduced some key aspects, such as Chasm City. It doesn’t really add anything to the overall story, but has some interesting insights into refersleep.
 

Grafenwalder's Bestiary

 
Some of the best stories are those which are difficult to read due to the behaviour of some of the characters. When they do things you can’t understand the motivation for and could not imagine doing yourself. In this story it’s cruelty, deception and revenge and I loved it.


Nightingale

I do wonder how Alastair Reynolds thinks up these horrors, but they are glorious. This story is particularly horrible at the end. The evil computer was far worse than anything in the Resident Evil series, with undertones of Hal 9000. There’s exploration in the story, battle, weapons and the sort of intrigue which makes it difficult to put down. 


Galactic North 

This should be expanded to a novel, or at least a novella. There’s scope for so much evolution, especially with the greenfly and how they come to take over. I couldn’t put this down, and wouldn’t have done it if my Kindle hadn’t died a few pages before the end!

In some ways it’s a shame that Alastair Reynolds has put a hard limit on the timeline of Revelation Space, but I loved it! I reread Galactic North to understand the comments at the end of Inhibitor Phase and the Nest Builders. I should have read it first. And I should have a Revelation Space timeline on my wall.


Study of developers for the cost of a phase I clinical drug trial

For many years now, I have been telling people that software researchers need to be more ambitious and apply for multi-million pound/dollar grants to run experiments in software engineering. After all, NASA spends a billion or so sending a probe to take some snaps of a planet and astronomers lobby for $100million funding for a new telescope.

What kind of experimental study might be run for a few million pounds (e.g., the cost of a Phase I clinical drug trial)?

Let’s say that each experiment involves a team of professional developers implementing a software system; call this a Project. We want the Project to be long enough to be realistic, say a week.

Different people exhibit different performance characteristics, and the experimental technique used to handle this is to have multiple teams independently implement the same software system. How many teams are needed? Fifteen ought to be enough, but more is better.

Different software systems contain different components that make implementation easier/harder for those involved. To remove single system bias, a variety of software systems need to be used as Projects. Fifteen distinct Projects would be great, but perhaps we can get away with five.

How many developers are on a team? Agile task estimation data shows that most teams are small, i.e., mostly single person, with two and three people teams making up almost all the rest.

If we have five teams of one person, five of two people, and five of three people, then there are 15 teams and 30 people.

How many people will be needed over all Projects?

15 teams (30 people) each implementing one Project
 5 Projects, which will require 5*30=150 people (5*15=75 teams)

How many person days are likely to be needed?

If a 3-person team takes a week (5 days), a 2-person team will take perhaps 7-8 days. A 1-person team might take 9-10 days.

The 15 teams will consume 5*3*5+5*2*7+5*1*9=190 person days
The  5 Projects will consume              5*190=950 person days

How much is this likely to cost?

The current average daily rate for a contractor in the UK is around £500, giving an expected cost of 190*500=£475,000 to hire the experimental subjects. Venue hire is around £40K (we want members of each team to be co-located).

The above analysis involves subjects implementing one Project. If, say, each subject implements two, three or four Projects, one after the other, the cost is around £2million, i.e., the cost of a Phase I clinical drug trial.

What might we learn from having subjects implement multiple Projects?

Team performance depends on the knowledge and skill of its members, and their ability to work together. Data from these experiments would be the first of their kind, and would provide realistic guidance on performance factors such as: impact of team size; impact of practice; impact of prior experience working together; impact of existing Project experience. The multiple implementations of the same Project created provide a foundation for measuring expected reliability and theories of N-version programming.

A team of 1 developer will take longer to implement a Project than a team of 2, who will take longer than a team of 3.

If 20 working days is taken as the ballpark period over which a group of subjects are hired (i.e., a month), there are six team size sequences that one subject could work (A to F below); where individual elapsed time is close to 20 days (team size 1 is 10 days elapsed, team size 2 is 7.5 days, team size 3 is 5 days).

Team size    A      B      C      D      E      F
    1      twice   once   once  
    2                     once  thrice  once
    3             twice                twice   four

The cost of hiring subjects+venue+equipment+support for such a study is likely to be at least £1,900,000.

If the cost of beta testing, venue hire and research assistants (needed during experimental runs) is included, the cost is close to £2.75 million.

Might it be cheaper and simpler to hire, say, 20-30 staff from a medium size development company? I chose a medium-sized company because we would be able to exert some influence over developer selection and keeping the same developers involved. The profit from 20-30 people for a month is not enough to create much influence within a large company, and a small company would not want to dedicate a large percentage of its staff for a solid month.

Beta testing is needed to validate both the specifications for each Project and that it is possible to schedule individuals to work in a sequence of teams over a month (individual variations in performance create a scheduling nightmare).

On A Generally Fractal Family – student

Recently, my fellow students and I have been caught up in the craze that is sweeping through the users of Professor B------'s clockwork calculating engine; namely the charting of sets of two dimensional points that have fractal planar boundaries, being those that in some sense have a fractional dimension. Of particular interest have been the results of repeated applications of quadratic functions to complex numbers; specifically in measuring how quickly, if at all, they escape a region surrounding the starting point, by which charts may be constructed that many of the collegiate consider so delightful as to constitute art painted by mathematics itself!

User Stories by Example tutorials now Free

My User Stories by Example tutorial series is now free.

There are five tutorials which include video lectures, worked examples with on real user stories and exercises covering user stories basics, acceptance criteria, story splitting, story refactoring and more.

The series is based on my Little Book of Requirements and User Stories – the audio files for the book are there too but you will have to pay for them. If you are an Audible subscriber you can get the book there as part of you subscription. Print book, eBook and audio book are all available at Amazon.


Subscribe and download Continuous Digital for free

The post User Stories by Example tutorials now Free appeared first on Allan Kelly.

Growth in FLOPS used to train ML models

AI (a.k.a. machine learning) is a compute intensive activity, with the performance of trained models being dependent on the quantity of compute used to train the model.

Given the ongoing history of continually increasing compute power, what is the maximum compute power that might be available to train ML models in the coming years?

How might the compute resources used to train an ML model be measured?
One obvious answer is to specify the computers used and the numbers of days used they were occupied training the model. The problem with this approach is that the differences between the computers used can be substantial. How is compute power measured in other domains?

Supercomputers are ranked using FLOPS (floating-point operations per second), or GigFLOPS or PetaFLOPS (10^{15}). The Top500 list gives values for R_{max} (based on benchmark performance, i.e., LINNPACK) and R_{peak} (what the hardware is theoretically capable of, which is sometimes more than twice R_{max}).

A ballpark approach to measuring the FLOPS consumed by an application is to estimate the FLOPS consumed by the computers involved and multiply by the number of seconds each computer was involved in training. The huge assumption made with this calculation is that the application actually consumes all the FLOPS that the hardware is capable of supplying. In some cases this appears to be the metric used to estimate the compute resources used to train an ML model. Some published papers just list a FLOPS value, while others list the number of GPUs used (e.g., 2,128).

A few papers attempt a more refined approach. For instance, the paper describing the GPT-3 models derives its FLOPS values from quantities such as the number of parameters in each model and number of training tokens used. Presumably, the research group built a calibration model that provided the information needed to estimate FLOPS in this way.

How does one get to be able to use PetaFLOPS of compute to train a model (training the GPT-3 175B model consumed 3,640 PetaFLOP days, or around a few days on a top 8 supercomputer)?

Pay what it costs. Money buys cloud compute or bespoke supercomputers (which are more cost-effective for large scale tasks, if you have around £100million to spend plus £10million or so for the annual electricity bill). While the amount paid to train a model might have lots of practical value (e.g., can I afford to train such a model), researchers might not be keen to let everybody know how much they spent. For instance, if a research team have a deal with a major cloud provider to soak up any unused capacity, those involved probably have no interest in calculating compute cost.

How has the compute power used to train ML models increased over time? A recent paper includes data on the training of 493 models, of which 129 include estimated FLOPS, and 106 contain date and model parameter data. The data comes from published papers, and there are many thousands of papers that train ML models. The authors used various notability criteria to select papers, and my take on the selection is that it represents the high-end of compute resources used over time (which is what I’m interested in). While they did a great job of extracting data, there is no real analysis (apart from fitting equations).

The plot below shows the FLOPS training budget used/claimed/estimated for ML models described in papers published on given dates; lines are fitted regression models, and the colors are explained below (code+data):

FLOPS consumed training ML models over time.

My interpretation of the data is based on the economics of accessing compute resources. I see three periods of development:

  1. do-it yourself (18 data points): During this period most model builders only had access to a university computer, desktop machines, or a compute cluster they had self-built,
  2. cloud (74 data points): Huge on demand compute resources are now just a credit card away. Researchers no longer have to wait for congested university computers to become available, or build their own systems.

    AWS launched in 2006, and the above plot shows a distinct increase in compute resources around 2008.

  3. bespoke (14 data points): if the ML training budget is large enough, it becomes cost-effective to build a bespoke system, e.g., a supercomputer. As well as being more cost-effective, a bespoke system can also be specifically designed to handle the characteristics of the kinds of applications run.

    How might models trained using a bespoke system be distinguished from those trained using cloud compute? The plot below shows the number of parameters in each trained model, over time, and there is a distinct gap between 10^{10} and 10^{11} parameters, which I assume is the result of bespoke systems having the memory capacity to handle more parameters (code+data):

    Number of parameters in ML models over time.

The rise in FLOPS growth rate during the Cloud period comes from several sources: 1) the exponential decline in the prices charged by providers delivers researchers an exponentially increasing compute for the same price, 2) researchers obtaining larger grants to work on what is considered to be an important topic, 3) researchers doing deals with providers to make use of excess capacity.

The rate of growth of Cloud usage is capped by the cost of building a bespoke system. The future growth of Cloud training FLOPS will be constrained by the rate at which the prices charged for a FLOP decreases (grants are unlikely to continually increase substantially).

The rate of growth of the Top500 list is probably a good indicator of the rate of growth of bespoke system performance (and this does appear to be slowing down). Perhaps specialist ML training chips will provide performance that exceeds that of the GPU chips currently being used.

The maximum compute that can be used by an application is set by the reliability of the hardware and the percentage of resources used to recover from hard errors that occur during a calculation. Supercomputer users have been facing the possibility of hitting the wall of maximum compute for over a decade. ML training is still a minnow in the supercomputer world, where calculations run for months, rather than a few days.

Migrating source code from RCS to Mercurial

Version control system migrations are a fact of life for developers in any longer lived codebase. In fact, I’ve had a hand in quite a few migrations as newer, more workable version control systems became available. Also, like a lot of developers, I’ve got fragments of source code dating back quite some years floating around on various servers and development machines of mine. Not necessarily code that is still being used, but still code that I don’t want to just delete forever.

The Agile Elephant and the agile mindset

African Elephant

Confession: I’ve been avoiding the words “agile mindset” for some time because I don’t know what it is. And, completely by coincidence, I’ve recently had a couple of encounters that have caused me to think again. So let me explain…

I repeatedly find myself wrestling with the question “What is agile?” The question came up recently in a new form when I was invited to give a talk on “The Agile Mindset.” I appealed for help on LinkedIn. I got some great answers and the diversity of answers confirmed what I though: it is hard to describe “the agile mindset” in a short or generally agreed form.

The first problem is that to explain “the Agile mindset” one first has to agree what agile is, and is not. I have my own view but I know there is a diversity of opinion so I find it useful to describe “Agile” with the story of the blind men examining an elephant: one feels the leg and says “This is a mighty tree”, another feels the tusk and says “It is a strong sword”, another the trunk and says “It is a strong snake” and so on. Each interprets the part they encounter as the whole yet the whole, to one who has never seen an elephant, can be hard to comprehend.

Illustration from the Natural History Museum, London

The same is true for agile.

The literalist looks in the dictionary and says “Agile is about being fast, reactive and responding to the outside”, the engineer looks at agile and says “It is about doing quality work so we may deliver more”, the Scum aficionado says “It is about high performing teams and alignment”, the Lean thinker says “It is about reducing work in progress and simplifying workflows” and the management consultant says “It is about delivering more with less.”

All are right, none is wrong. And while that is a problem in describing what agile is it is also a strength. Agile is multi-faceted and offers “something for everyone.” While different people emphasis different things it also means the whole is more than the sum of the parts. If you can harness high performing teams, with engineering quality, low WIP and reactive processes then you can deliver the fabled faster, better, cheaper.

But that also makes it hard.

It also goes some way to explaining why “Agile Coaches” never agree: each has their own interpretation of how to put those pieces together to make the whole – to change metaphor, everyone approaches the jigsaw differently.

And again that is right because every jigsaw, every application of agile, exists in a unique context and must be faced on its own terms – to quote Tolstoy: “All happy families are the same, all unhappy families are unhappy in their own unique way.” (And long time readers might notice I just contradicted myself.)

And one important reason why the jigsaw is always different is: in completing the last jigsaw, and since completing it, you, and everyone else as learned, the bodies may be the same but the people – and their minds – are different.

Ultimately, I still claim “Agile” is learning, specifically organizational learning: the thesis I laid out in my first book over 10 years ago Changing Software Development.

Hence I say: The only thing you can do wrong in agile is work the same as you did three months ago. To be agile one should always be learning and changing as a result of that learning.

I should explain that some more in another post, and I’ll have more to say about the agile mindset soon.


Subscribe to my blog newsletter and download Continuous Digital for free

The post The Agile Elephant and the agile mindset appeared first on Allan Kelly.

Comparison of Matrix events before and after “Extensible Events”

(Background: Matrix is the awesome open standard for messaging that I get to work on now that I work at Element.)

The Extensible Events (MSC1767) Matrix Spec Change proposal describes a new way of structuring events in matrix that makes it easy to send events that have multiple representations (e.g. something clever like an interactive map, and something simpler like an image of a map).

The main purpose of the change is to make it easy for clients that don’t support some amazing new feature to display something that is still useful.

Since there is an implementation of this change out in the wild (in Element), it seems reasonably likely that this change will be accepted into the Matrix spec.

I really like this change, but I find it hard to understand, so here is a simple example that I have found helpful to think it through.

An old event, and a new event

Here is an old-fashioned event, followed by a new, shiny, extensible version:

{
    "type": "m.room.message",
    "content": {
        "body": "This is the *old* way",
        "format": "org.matrix.custom.html",
        "formatted_body": "This is the <b>old</b> way",
        "msgtype": "m.text"
    },
    ... other properties not relevant to this, e.g. "sender" ...
}
{
    "type": "m.message",
    "content": {
        "m.message": [
            {"mimetype": "text/plain", "body": "This the *new* way"},
            {"mimetype": "text/html", "body": "This is the <b>new</b> way"}
        ],
    }
    ... other properties not relevant to this, e.g. "sender" ...
}

Notice that in the new extensible events, the property within content is the same as the message type (here: m.message).

The point is that as well as the primary event type (here, m.message) we can other representations of the same message, such as an image, location co-ordinates, or something completely different. The client will render the primary event type if it understands it (and is able to show it), but if not, it can look for other types that it does understand.

For example, in Polls when you send a new poll question, it could look like this:

{
    "type": "m.poll.start",
    "content": {
        "m.poll.start": {
            ... The actual poll question etc. ...
        },
        "m.message": [
            ... A text version of the question ...
        ]
    },
    ... other properties not relevant to this, e.g. "sender" ...
}

So clients that don’t know m.poll.start can still display the poll question (if they understand extensible events), instead of completely ignoring event types they don’t know about.

An abbreviated form of the new event

Of course, life is not quite as simple as that.

Because this is a lot of typing:

{
    "type": "m.message",
    "content": {
        "m.message": [
            {"mimetype": "text/plain", "body": "This the *new* way"},
            {"mimetype": "text/html", "body": "This is the <b>new</b> way"}
        ],
    }
    ... other properties not relevant to this, e.g. "sender" ...
}

We have an abbreviated form:

{
    "type": "m.message",
    "content": {
        "m.text": "This the *new* way",
        "m.html": "This is the <b>new</b> way"
    }
    ... other properties not relevant to this, e.g. "sender" ...
}

These two are exactly equivalent.

m.text is an abbreviation for an m.message containing an entry with "mimetype": "text/plain" and the relevant body. Similarly, m.html is an abbreviation for an m.message containing an entry with "mimetype": "text/html" and the relevant body. If you declare both, they effectively get squashed together into one m.message with both entries.

Those 2 are the only abbreviations listed, so they are special cases.

Backwards compatibility

Of course, life is way more complicated than that, so what we’re likely to see around if/when this gets widely adopted is some kind of mashed-together event like this:

{
    "type": "m.room.message",
    "content": {
        "msgtype": "m.text",
        "body": "Hello World",
        "format": "org.matrix.custom.html",
        "formatted_body": "<b>Hello</b> World",
        "m.text": "Hello World",
        "m.html": "<b>Hello</b> World"
    }
}

Note that the type here is m.room.message, where extensible events says it should be m.message. The idea is that an extensible-events-aware client will see "msgtype": "m.text" and know to look for m.message as the primary type. (This is further complicated here by the fact that there isn’t actually a m.message property – this is because m.text and m.html are abbreviated forms of it.)

Also, clients that want to display old events will need to preserve their code that parses the old event types in perpetuity.

Cost-effectiveness decision for fixing a known coding mistake

If a mistake is spotted in the source code of a shipping software system, is it more cost-effective to fix the mistake, or to wait for a customer to report a fault whose root cause turns out to be that particular coding mistake?

The naive answer is don’t wait for a customer fault report, based on the following simplistic argument: C_{fix} < C_{find}+C_{fix}.

where: C_{fix} is the cost of fixing the mistake in the code (including testing etc), and C_{find} is the cost of finding the mistake in the code based on a customer fault report (i.e., the sum on the right is the total cost of fixing a fault reported by a customer).

If the mistake is spotted in the code for ‘free’, then C_{find}==0, e.g., a developer reading the code for another reason, or flagged by a static analysis tool.

This answer is naive because it fails to take into account the possibility that the code containing the mistake is deleted/modified before any customers experience a fault caused by the mistake; let M_{gone} be the likelihood that the coding mistake ceases to exist in the next unit of time.

The more often the software is used, the more likely a fault experience based on the coding mistake occurs; let F_{experience} be the likelihood that a fault is reported in the next time unit.

A more realistic analysis takes into account both the likelihood of the coding mistake disappearing and a corresponding fault being reported, modifying the relationship to: C_{fix} < (C_{find}+C_{fix})*{F_{experience}/M_{gone}}

Software systems are eventually retired from service; the likelihood that the software is maintained during the next unit of time, S_{maintained}, is slightly less than one.

Giving the relationship: C_{fix} < (C_{find}+C_{fix})*{F_{experience}/M_{gone}}*S_{maintained}

which simplifies to: 1 < (C_{find}/C_{fix}+1)*{F_{experience}/M_{gone}}*S_{maintained}

What is the likely range of values for the ratio: C_{find}/C_{fix}?

I have no find/fix cost data, although detailed total time is available, i.e., find+fix time (with time probably being a good proxy for cost). My personal experience of find often taking a lot longer than fix probably suffers from survival of memorable cases; I can think of cases where the opposite was true.

The two values in the ratio F_{experience}/M_{gone} are likely to change as a system evolves, e.g., high code turnover during early releases that slows as the system matures. The value of F_{experience} should decrease over time, but increase with a large influx of new users.

A study by Penta, Cerulo and Aversano investigated the lifetime of coding mistakes (detected by several tools), tracking them over three years from creation to possible removal (either fixed because of a fault report, or simply a change to the code).

Of the 2,388 coding mistakes detected in code developed over 3-years, 41 were removed as reported faults and 416 disappeared through changes to the code: F_{experience}/M_{gone} = 41/416 = 0.1

The plot below shows the survival curve for memory related coding mistakes detected in Samba, based on reported faults (red) and all other changes to the code (blue/green, code+data):

Survival curves of coding mistakes in Samba.

Coding mistakes are obviously being removed much more rapidly due to changes to the source, compared to customer fault reports.

For it to be cost-effective to fix coding mistakes in Samba, flagged by the tools used in this study (S_{maintained} is essentially one), requires: 10 < C_{find}/C_{fix}+1.

Meeting this requirement does not look that implausible to me, but obviously data is needed.

Time For A Chi Test – a.k.

A few months ago we explored the chi-squared distribution which describes the properties of sums of squares of standard normally distributed random variables, being those that have means of zero and standard deviations of one.
Whilst I'm very much of the opinion that statistical distributions are worth describing in their own right, the chi-squared distribution plays a pivotal role in testing whether or not the categories into which a set of observations of some variable quantity fall are consistent with assumptions about the expected numbers in each category, which we shall take a look at in this post.

Software engineering research is a field of dots

Software engineering research is a field of dots; people are fully focused on publishing papers about their chosen tiny little subject.

Where are the books joining the dots into even a vague outline?

Several software researchers have told me that writing books is not a worthwhile investment of their time, i.e., the number of citations they are likely to attract makes writing papers the only cost-effective medium (books containing an edited collection of papers continue to be published).

Butterfly collecting has become the method of study for many researchers. The butterflies in question often being Github repos that are collected together, based on some ‘interestingness’ metric, and then compared and contrasted in a conference paper.

The dots being collected are influenced by the problems that granting agencies consider to be important topics to fund (picking a research problem that will attract funding is a major consideration for any researcher). Fake research is one consequence of incentivizing people to use particular techniques in their research.

Whatever you think the aims of research in software engineering might be, funding the random collecting of dots does not seem like an effective strategy.

Perhaps it is just a matter of waiting for the field to grow up. Evidence-based software engineering research is still a teenager, and the novelty of butterfly collecting has yet to wear off.

My study of particular kinds of dots did not reveal many higher level patterns, although a number of folk theories were shown to be unfounded.

A review: Inhibitor Phase by Alastair Reynolds

Inhibitor Phase
by Alastair Reynolds
 
 ISBN-13 ‏ : ‎ 978-0316462761

* * * Warning Spoilers * * *


To say I was excited at the prospect of another core Revelation Space novel, more than a decade since Absolutely Gap, wouldn’t come close. In preparation I reread Absolution Gap and loved it on the second reading.

I wasn’t inspired by the description of the Miguel character hiding from the Wolves on an unknown planet, but it turns out this was just a minor distraction at the beginning and that The Inhibitor phase plays a major part in advancing the story. The scope and breadth, as you would expect from Alistair Reynolds is vast and intricate.

I was a little disappointed that the characters were ping ponging between some of the same old worlds, Ararat and Yellowstone, and the evolution of some of the survivors from Redemption Ark into Merpeople, but this didn’t detract in any way. It either wasn’t clear or I missed what happened to Ana Khouri - maybe she’s still on Hela. It was sad, but probably necessary to see the end to the Nostalgia for Infinity. I also missed how, following her death on Mars, Glass and Warren had reencountered each other and swum with the pattern Jugglers prior to Sun Hollow.

Overall I loved this story. I literally could not put it down! Alastair Reynolds is the master of descriptive exploration and constantly hints at more facets to the story I just have to know and have to keep reading for! The Inhibitor Phase does of course leave questions unanswered and sets up the next story, which I cannot wait for either!
 

How can I pin dependent packages when using use-package?

I’ve been trying to up my use-package game recently and converted my hand rolled package check and installer to use-package. I usually prefer to use packages from melpa-stable so I pin the default package source used by use-package to melpa-stable and override it where necessary That’s working well in general and looks something like this: (setq use-package-always-pin "melpa-stable") (use-package js2-mode :ensure t :defer t :custom (progn (js-indent-level 2) (js2-include-node-externs t))) (use-package kotlin-mode :ensure t :pin melpa) So in other words, if I’m on a machine that doesn’t have js2-mode and kotlin-mode installed, use-package will install js2-mode from melpa-stable and kotlin-mode from melpa.

Estimation experiments: specification wording is mostly irrelevant

Existing software effort estimation datasets provide information about estimates made within particular development environments and with particular aims. Experiments provide a mechanism for obtaining information about estimates made under conditions of the experimenters choice, at least in theory.

Writing the code is sometimes the least time-consuming part of implementing a requirement. At hackathons, my default estimate for almost any non-trivial requirement is a couple of hours, because my implementation strategy is to find the relevant library or package and write some glue code around it. In a heavily bureaucratic organization, the coding time might be a rounding error in the time taken up by meeting, documentation and testing; so a couple of months would be considered normal.

If we concentrate on the time taken to implement the requirements in code, then estimation time and implementation time will depend on prior experience. I know that I can implement a lexer for a programming language in half-a-day, because I have done it so many times before; other people take a lot longer because they have not had the amount of practice I have had on this one task. I’m sure there are lots of tasks that would take me many days, but there is somebody who can implement them in half-a-day (because they have had lots of practice).

Given the possibility of a large variation in actual implementation times, large variations in estimates should not be surprising. Does the possibility of large variability in subject responses mean that estimation experiments have little value?

I think that estimation experiments can provide interesting information, as long as we drop the pretence that the answers given by subjects have any causal connection to the wording that appears in the task specifications they are asked to estimate.

If we assume that none of the subjects is sufficiently expert in any of the experimental tasks specified to realistically give a plausible answer, then answers must be driven by non-specification issues, e.g., the answer the client wants to hear, a value that is defensible, a round number.

A study by Lucas Gren and Richard Berntsson Svensson asked subjects to estimate the total implementation time of a list of tasks. I usually ignore software engineering experiments that use student subjects (this study eventually included professional developers), but treating the experiment as one involving social processes, rather than technical software know-how, makes subject software experience a lot less relevant.

Assume, dear reader, that you took part in this experiment, saw a list of requirements that sounded plausible, and were then asked to estimate implementation time in weeks. What estimate would you give? I would have thrown my hands up in frustration and might have answered 0.1 weeks (i.e., a few hours). I expected the most common answer to be 4 weeks (the number of weeks in a month), but it turned out to be 5 (a very ‘attractive’ round number), for student subjects (code+data).

The professional subjects appeared to be from large organizations, who I assume are used to implementations including plenty of bureaucratic stuff, as well as coding. The task specification did not include enough detailed information to create an accurate estimate, so subjects either assumed their own work environment or played along with the fresh-faced, keen experimenter (sorry Lucas). The professionals showed greater agreement in that the range of value given was not as wide as students, but it had a more uniform distribution (with maximums, rather than peaks, at 4 and 7); see below. I suspect that answers at the high end were from managers and designers with minimal coding experience.

What did the experimenters choose weeks as the unit of estimation? Perhaps they thought this expressed a reasonable implementation time (it probably is if it’s not possible to use somebody else’s library/package). I think that they could have chosen day units and gotten essentially the same results (at least for student subjects). If they had chosen hours as the estimation unit, the spread of answers would have been wider, and I’m not sure whether to bet on 7 (hours in a working day) or 10 being the most common choice.

Fitting a regression model to the student data shows estimates increasing by 0.4 weeks per year of degree progression. I was initially baffled by this, and then I realized that more experienced students expect to be given tougher problems to solve, i.e., this increase is based on self-image (code+data).

The stated hypothesis investigated by the study involved none of the above. Rather, the intent was to measure the impact of obsolete requirements on estimates. Subjects were randomly divided into three groups, with each seeing and estimating one specification. One specification contained four tasks (A), one contained five tasks (B), and one contained the same tasks as (A) plus an additional task followed by the sentence: “Please note that R5 should NOT be implemented” (C).

A regression model shows that for students and professions the estimate for (A) is about 1-2 weeks lower than (B), while (A) estimates are 3-5 weeks lower than (C) estimated.

What are subjects to make of an experimental situation where the specification includes a task that they are explicitly told to ignore?

How would you react? My first thought was that the ignore R5 sentence was itself ignored, either accidentally or on purpose. But my main thought is that Relevance theory is a complicated subject, and we are a very long way away from applying it to estimation experiments containing supposedly redundant information.

The plot below shows the number of subjects making a given estimate, in days; exp0to2 were student subjects (dashed line joins estimate that include a half-hour value, solid line whole hour), exp3 MSc students, and exp4 professional developers (code+data):

Number of subjects making a given estimate.

I hope that the authors of this study run more experiments, ideally working on the assumption that there is no connection between specification and estimate (apart from trivial examples).

Pitfall – baron m.

Greetings Sir R-----! Come warm yourself by the hearth and take a dram of scotch!

Would you care for a wager to fire up your blood?

Stout fellow!

I propose a game that puts me in mind of an ill-fated caving expedition that I undertook some several years ago.

Another quick Isso setup tweak

While I was implementing a few more changes on my web server - mostly adding the sorely needed blacklistd configuration for sshd - I noticed that NGINX’s log was showing occasional errors when trying to contact the Isso process. They all had one thing in common, namely that they were all trying to contact ISSO via IPV6 as the server has both stacks enabled. Turns out that isso only listens on an IPV4 socket and I could not find an obvious way to get it to listen on both.

Visual Lint 8.0.8.351 has been released

Visual Lint 8.0.8.351 has now been released.

This is a maintenance update for Visual Lint 8.0, and includes the following changes:

  • Fixed a bug in the handling of preprocessor symbol properties in Visual Studio projects.

  • VisualLintGui will now open files dropped on its main window.

  • Updated the PC-lint Plus compiler indirect files co-rb-vs2019.lnt and co-rb-vs2022.lnt to filter out errors in <xutility> when analysing some Visual Studio 2019 and 2022 projects.

  • Updated the PC-lint Plus compiler indirect file co-rb-vs2022.lnt to support Visual Studio 2022 v17.0.5.

  • Updated the PC-lint Plus compiler indirect file co-rb-vs2019.lnt to support Visual Studio 2019 v16.11.9.

  • Minor updates to the online help.

Download Visual Lint 8.0.8.351

semgrep: the future of static analysis tools

When searching for a pattern that might be present in source code contained in multiple files, what is the best tool to use?

The obvious answer is grep, and grep is great for character-based pattern searches. But patterns that are token based, or include information on language semantics, fall outside grep‘s model of pattern recognition (which does not stop people trying to cobble something together, perhaps with the help of complicated sed scripts).

Those searching source code written in C have the luxury of being able to use Coccinelle, an industrial strength C language aware pattern matching tool. It is widely used by the Linux kernel maintainers and people researching complicated source code patterns.

Over the 15+ years that Coccinelle has been available, there has been a lot of talk about supporting other languages, but nothing ever materialized.

About six months ago, I noticed semgrep and thought it interesting enough to add to my list of tool bookmarks. Then, a few days ago, I read a brief blog post that was interesting enough for me to check out other posts at that site, and this one by Yoann Padioleau really caught my attention. Yoann worked on Coccinelle, and we had an interesting email exchange some 13-years ago, when I was analyzing if-statement usage, and had subsequently worked on various static analysis tools, and was now working on semgrep. Most static analysis tools are created by somebody spending a year or so working on the implementation, making all the usual mistakes, before abandoning it to go off and do other things. High quality tools come from people with experience, who have invested lots of time learning their trade.

The documentation contains lots of examples, and working on the assumption that things would be a lot like using Coccinelle, I jumped straight in.

The pattern I choose to search for, using semgrep, involved counting the number of clauses contained in Python if-statement conditionals, e.g., the condition in: if a==1 and b==2: contains two clauses (i.e., a==1, b==2). My interest in this usage comes from ideas about if-statement nesting depth and clause complexity. The intended use case of semgrep is security researchers checking for vulnerabilities in code, but I’m sure those developing it are happy for source code researchers to use it.

As always, I first tried building the source on the Github repo, (note: the Makefile expects a git clone install, not an unzipped directory), but got fed up with having to incrementally discover and install lots of dependencies (like Coccinelle, the code is written on OCaml {93k+ lines} and Python {13k+ lines}). I joined the unwashed masses and used pip install.

The pattern rules have a yaml structure, specifying the rule name, language(s), message to output when a match is found, and the pattern to search for.

After sorting out various finger problems, writing C rather than Python, and misunderstanding the semgrep output (some of which feels like internal developer output, rather than tool user developer output), I had a set of working patterns.

The following two patterns match if-statements containing a single clause (if.subexpr-1), and two clauses (if.subexpr-2). The option commutative_boolop is set to true to allow the matching process to treat Python’s or/and as commutative, which they are not, but it reduces the number of rules that need to be written to handle all the cases when ordering of these operators is not relevant (rules+test).

rules:
- id: if.subexpr-1
  languages: [python]
  message: if-cond1
  patterns:
   - pattern: |
      if $COND1:  # we found an if statement
         $BODY
   - pattern-not: |
      if $COND2 or $COND3: # must not contain more than one condition
         $BODY
   - pattern-not: |
      if $COND2 and $COND3:
         $BODY
  severity: INFO

- id: if.subexpr-2
  languages: [python]
  options:
   commutative_boolop: true # Reduce combinatorial explosion of rules
  message: if-cond2
  pattern-either:
   - patterns:
      - pattern: |
         if $COND1 or $COND2: # if statement containing two conditions
            $BODY
      - pattern-not: |
         if $COND3 or $COND4 or $COND5: # must not contain more than two conditions
            $BODY
      - pattern-not: |
         if $COND3 or $COND4 and $COND5:
            $BODY
   - patterns:
      - pattern: |
         if $COND1 and $COND2:
            $BODY
      - pattern-not: |
         if $COND3 and $COND4 and $COND5:
            $BODY
      - pattern-not: |
         if $COND3 and $COND4 or $COND5:
            $BODY
  severity: INFO

The rules would be simpler if it were possible for a pattern to not be applied to code that earlier matched another pattern (in my example, one containing more clauses). This functionality is supported by Coccinelle, and I’m sure it will eventually appear in semgrep.

This tool has lots of rough edges, and is still rapidly evolving, I’m using version 0.82, released four days ago. What’s exciting is the support for multiple languages (ten are listed, with experimental support for twelve more, and three in beta). Roughly what happens is that source code is mapped to an abstract syntax tree that is common to all supported languages, which is then pattern matched. Supporting a new language involves writing code to perform the mapping to this common AST.

It’s not too difficult to map different languages to a common AST that contains just tokens, e.g., identifiers and their spelling, literals and their value, and keywords. Many languages use the same operator precedence and associativity as C, plus their own extras, and they tend to share the same kinds of statements; however, declarations can be very diverse, which makes life difficult for supporting a generic AST.

An awful lot of useful things can be done with a tool that is aware of expression/statement syntax and matches at the token level. More refined semantic information (e.g., a variable’s type) can be added in later versions. The extent to which an investment is made to support the various subtleties of a particular language will depend on its economic importance to those involved in supporting semgrep (Return to Corp is a VC backed company).

Outside of a few languages that have established tools doing deep semantic analysis (i.e., C and C++), semgrep has the potential to become the go-to static analysis tool for source code. It will benefit from the network effects of contributions from lots of people each working in one or more languages, taking their semgrep skills and rules from one project to another (with source code language ceasing to be a major issue). Developers using niche languages with poor or no static analysis tool support will add semgrep support for their language because it will be the lowest cost path to accessing an industrial strength tool.

How are the VC backers going to make money from funding the semgrep team? The traditional financial exit for static analysis companies is selling to a much larger company. Why would a large company buy them, when they could just fork the code (other company sales have involved closed-source tools)? Perhaps those involved think they can make money by selling services (assuming semgrep becomes the go-to tool). I have a terrible track record for making business predictions, so I will stick to the technical stuff.

The difficulties of cascading OKRs

I almost despair when I hear people advocate cascading OKRs: the idea that someone, some team, some central planning department, can set OKRs which then flow down the organization with each “lower” group implementing some small part of some “higher” ask. What could be more waterfall like?

I admit, when I started working with OKRs I kind-of-expected to be shown the OKRs of the “above” before my team wrote theirs. But when I thought about it, and the more I thought about it, the more I realised if you did do it that way then it is decided unAgile. How can a team be really autonomous, self-organising and self-managing if they have goals handed down to them?

There was a point when I was wracked with self-doubt: am I interpretting OKRs differently to the rest of the world? How do I reconcile agile and cascading OKRs? What am I missing? – but, when you look around, I am not the only one. In fact, if you read, watch and listen to OKR commentators the majority agree with me: the teams delivering OKRs need the latitude to set their own OKRs.

Reconciling OKRs with agile is far from the biggest problem. In fact there are, at least, two bigger problems, one concerns team motivation. Can a team ever be motivated to do something they have no say in? Perhaps some can, I can’t and I know others who don’t. At the very least team members need to be asked.

Motivation becomes especially problematic if you want OKRs to be stretching. If you set someone a stretching goal and ask them to hit it without involving them then don’t be surprised if they shrug their shoulders.

Still, we haven’t got to the biggest problem.

The biggest problem with OKRs is not the metaphysical issues of motivation and whether one is truly agile or not. The biggest difficulty is simply: cascading OKRs are not practical.

First think about the timetable.

If every team is waiting for the team above them to issue OKRs before they set their own then you have a delay built into the system. And the more levels of hierarchy you have the greater the delay is going to be.

For example, suppose you have an executive team, and middle management team and several delivery teams. Then each cycle the exec team need to set some OKRs, once they have set their the middle management can set theirs, and then the delivery teams can set theirs. At each cascade point there needs to be communication, and each point creates the possibility of misunderstanding and mistakes.

Setting OKRs isn’t instantaneous, I think you need about a week to have a think, reflect overnight, iterate once or twice but, if you are well practices, and don’t hit any delays, you might do it in two days. Either way it is going to take at least a week, and possibly three, to get all three layers set. And if anyone runs late then it has a knock on effect.

I’ve heard it said that the Key Results of higher levels become the objectives of the next layer down. The key results of this layer the become the objectives of the one below them. But that assume that the OKRs themselves are a series of “items to do” and that each objective is made up of several pieces which are themselves things to do.

Sure, it sometimes happens that way. I may even have been guilty of interpreting them that way sometimes. But these days I see Key Results not as small pieces of work which, lego style, build into a bigger objective but as Acceptance Criteria: the parameters which the outcome needs to satisfy.

Now to some degree acceptance criteria can be translated into work items to do, and vice versa, but not always. Consider this:

Objective: Improve overnight batch processing to save 10% of work processing costs
Key result #1: Shorten batch processing time by 1 hour so staff do not need to wait for run to complete in the morning
Key result #2: Reduce false positive alerts by 100 per day so that staff waste less time

Now these key results could be packaged as individual work to do but perhaps they are the same piece or work. Perhaps a database upgrade could address both issues in one go. Which path you take is a design decision.

Seeing key results as acceptance criteria changes them from work to do into bounding conditions.

In Succeeding with OKRs in Agile I advise against having domino key results: don’t set key results so that failing to hit one makes others impossible to hit. So, for example, if the DB upgrade had been added to that previous example as key result #1 then the team would have been committed to doing it. And if the upgrade had failed then the other key results would have been lost. Leaving it out gives the team the decision on how to proceed: the people doing the work decide the best way of meeting the objective.

That advice is given within teams but it also applies between teams. If, the Middle Management team require three lesser teams to deliver work to build their own objective then, if any one team fail the middle management team will not only miss one key result but will therefore miss their objective.

Done like this the OKRs become fragile and a dependency nightmare. That will have two effects, first more time will be needed when setting OKRs to identify and mitigate the dependencies, then more time will be needed to manage the dependencies. Progress will only occur at the speed of the slowest.

Second, these problems will encourage people to play it safe and not set stretching and ambitious OKRs. Predictability and safety will be prioritised.

Now if we take the alternative approach and each team sets its OKRs independently then the time lag is removed, teams set OKRs in parallel and if someone is late it doesn’t matter. Dependencies may still exist but they have not been baked into the OKRs so teams can put effort into removing dependencies (reducing coupling and increasing cohesion) rather than putting that energy into managing the dependencies.

So, while we might argue about whether OKRs should, or should not, cascade down; and while we might argue about the psychological effects of being given an OKR by another, simply remember: cascading OKRs mean setting OKRs is going to be more complicated and take longer.

Photo by Alexander Hipp on Unsplash


Subscribe to my blog newsletter and download Continuous Digital for free

The post The difficulties of cascading OKRs appeared first on Allan Kelly.

Finding patterns in construction project drawing creation dates

I took part in Projecting Success‘s 13th hackathon last Thursday and Friday, at CodeNode (host to many weekend hackathons and meetups); around 200 people turned up for the first day. Team Designing-Success included Imogen, Ryan, Dillan, Mo, Zeshan (all building construction domain experts) and yours truly (a data analysis monkey who knows nothing about construction).

One of the challenges came with lots of real multi-million pound building construction project data (two csv files containing 60K+ rows and one containing 15K+ rows), provided by SISK. The data contained information on project construction drawings and RFIs (request for information) from 97 projects.

The construction industry is years ahead of the software industry in terms of collecting data, in that lots of companies actually collect data (for some, accumulate might be a better description) rather than not collecting/accumulating data. While they have data, they don’t seem to be making good use of it (so I am told).

Nearly all the discussions I have had with domain experts about the patterns found in their data have been iterative, brief email exchanges, sometimes running over many months. In this hack, everybody involved is sitting around the same table for two days, i.e., the conversation is happening in real-time and there is a cut-off time for delivery of results.

I got the impression that my fellow team-mates were new to this kind of data analysis, which is my usual experience when discussing patterns recently found in data. My standard approach is to start highlighting visual patterns present in the data (e.g., plot foo against bar), and hope that somebody says “That’s interesting” or suggests potentially more interesting items to plot.

After several dead-end iterations (i.e., plots that failed to invoke a “that’s interesting” response), drawings created per day against project duration (as a percentage of known duration) turned out to be of great interest to the domain experts.

Building construction uses a waterfall process; all the drawings (i.e., a kind of detailed requirements) are supposed to be created at the beginning of the project.

Hmm, many individual project drawing plots were showing quite a few drawings being created close to the end of the project. How could this be? It turns out that there are lots of different reasons for creating a drawing (74 reasons in the data), and that it is to be expected that some kinds of drawings are likely to be created late in the day, e.g., specific landscaping details. The 74 reasons were mapped to three drawing categories (As built, Construction, and Design Development), then project drawings were recounted and plotted in three colors (see below).

The domain experts (i.e., everybody except me) enjoyed themselves interpreting these plots. I nodded sagely, and occasionally blew my cover by asking about an acronym that everybody in the construction obviously knew.

The project meta-data includes a measure of project performance (a value between one and five, derived from profitability and other confidential values) and type of business contract (a value between one and four). The data from the 97 projects was combined by performance and contract to give 20 aggregated plots. The evolution of the number of drawings created per day might vary by contract, and the hypothesis was that projects at different performance levels would exhibit undesirable patterns in the evolution of the number of drawings created.

The plots below contain patterns in the quantity of drawings created by percentage of project completion, that are: (left) considered a good project for contract type 1 (level 5 are best performing projects), and (right) considered a bad project for contract type 1 (level 1 is the worst performing project). Contact the domain experts for details (code+data):

Number of drawings created at percentage project completion times.

The path to the above plot is a common one: discover an interesting pattern in data, notice that something does not look right, use domain knowledge to refine the data analysis (e.g., kinds of drawing or contract), rinse and repeat.

My particular interest is using data to understand software engineering processes. How do these patterns in construction drawings compare with patterns in the software project equivalents, e.g., detailed requirements?

I am not aware of any detailed public data on requirements produced using a waterfall process. So the answer is, I don’t know; but the rationales I heard for the various kinds of drawings sound as-if they would have equivalents in the software requirements world.

What about the other data provided by the challenge sponsor?

I plotted various quantities for the RFI data, but there wasn’t any “that’s interesting” response from the domain experts. Perhaps the genius behind the plot ideas will be recognized later, or perhaps one of the domain experts will suddenly realize what patterns should be present in RFI data on high performance projects (nobody is allowed to consider the possibility that the data has no practical use). It can take time for the consequences of data analysis to sink in, or for new ideas to surface, which is why I am happy for analysis conversations to stretch out over time. Our presentation deck included some RFI plots because there was RFI data in the challenge.

What is the software equivalent of construction RFIs? Perhaps issues in a tracking system, or Jira tickets? I did not think to talk more about RFIs with the domain experts.

How did team Designing-Success do?

In most hackathons, the teams that stay the course present at the end of the hack. For these ProjectHacks, submission deadline is the following day; the judging is all done later, electronically, based on the submitted slide deck and video presentation. The end of this hack was something of an anti-climax.

Did team Designing-Success discover anything of practical use?

I think that finding patterns in the drawing data converted the domain experts from a theoretical to a practical understanding that it was possible to extract interesting patterns from construction data. They each said that they planned to attend the next hack (in about four months), and I suggested that they try to bring some of their own data.

Can these drawing creation patterns be used to help monitor project performance, as it progressed? The domain experts thought so. I suspect that the users of these patterns will be those not closely associated with a project (those close to a project are usually well aware of that fact that things are not going well).

A Clash of Kings a Review


A Clash of Kings: Book 2 (A Song of Ice and Fire)

George R.R. Martin

ISBN-13 ‏ : ‎ 978-0007447831

I loved the Game of Thrones TV series Even the way the final series ends. Although I’m not sure I would have chosen the eventual king. I was looking forward to reading the books and understanding the stories in more depth and, to an extent, that was the case. More so with the first book than the second.

A Clash of Kings just has too much irrelevant detail and quickly becomes laborious to read. A part which stands out is after one of the battles where there are many pages given over to a list of knights who were awarded honours. The vast majority were in no way relevant to the story and just prolonged getting to the end. Fortunately the last 3% (I was reading on kindle) was given over to an appendix so I was able to skim that.

There were a number of key events from the TV series, not least of which Bron lighting the wildfire with an arrow, which I was looking out for and were disappointingly missing. Of course the book is the original and these events were invented for the TV series, but still.

I’m told the books get better from the third one onwards, so once I’ve got through Inhibitor Phase, Dune and one or two others I’ll be back. I’ve started now, so I need to finish.

A Jolly Student’s Tea Party – a.k.

Last time we took a look at the chi-squared distribution which describes the behaviour of sums of squares of standard normally distributed random variables, having means of zero and standard deviations of one.
Tangentially related is Student's t-distribution which governs the deviation of means of sets of independent observations of a normally distributed random variable from its known true mean, which we shall examine in this post.

Class war in the modern workplace

Writing about “The knowledge worker” in The Age of Discontinuity (1968) Peter Drucker offers this insight: the knowledge worker sees themselves as a skilled professional, or “white collar” to use an old term. To this end programmers, marketeers and conference producers see themselves as akin to doctors and lawyers.

But, Drucker says, these professionals are still employees and are seen by their employers as “workers” more akin to the “blue collar” factory employees of years gone by. I’ve long found it ironic that many contractors like to see themselves as entrepreneurs and small businesses while their clients may well see them more as casual day-labourers who can be hired and disposed of with little thought.

Quote from Peter Drucker book
Age of Discontinuity “The

Look at it like this: once upon a time the majority of employees would be working on the factory floor, their hands would get dirty. The work of the manger was to optimise both the work they were doing and the way the work was done, employees were probably not encouraged to think too much.

Today is the age of mass knowledge work – at least in developed western countries. The majority of employees type at keyboards and spend all day talking to others about ideas. Their hands stay clean.

Look at the qualifications people hold: in the past blue-collar workers might have a skill, many were “unskilled” or “semi skilled. Workers “trade” might be learned via an apprenticeship which involved observing and doing the work. Today we live in the age of mass knowledge work, the bulk of the workforce are not factory workers or dock labourers but educated analysts, programmers, accountants and such.

Modern blue-collar workers are overwhelmingly degree educated and do expect to have a say in their work – both what is done and how it is done.

When I visit clients I sometimes see – or rather hear – this explicitly when managerial types talk about “the factory floor” or “engine room” when they mean the offices where IT staff work. You see it too when teams are treated as “feature factories” and measured on how many “user stories” or “story points” they complete and how fast the burn-down chart burns down.

There is a mismatch in the way workers view themselves and the way the managers view the workers. This mismatch in views becomes a misunderstanding and can escalate into conflict.

Workers advocate replacing management with “self managing teams” and managers look to replace programmers with with automatic programming, “programming through pictures” or lately “lo code” solutions. Rather than resolving the conflict it becomes an existential fight. Sometimes it can even look like a modern class struggle between workers and employers.

There are no silver bullets to this conflict. Both sides needs to respect one another and we need to find a new understanding of roles.

One of my favourite takes on this dilemma comes from Tim O’Reilly. In an essay entitled “Managing the Bots That Are Managing the Business” (2016) he suggests that the true workers of today are the machines: it is the factory robots that assemble cars, take our online orders and increasingly do all the “heavy lifting”. The managers of these machines are the people who instruct the machine what to do and how to do it. Most obviously programmers but also the many who work use digital technology to instruct a machine, e.g. a marketeer who schedules tweets, or customer service agent who scripts the chat-bot.

Either way you look at it the fact is that modern blue collar workers need more of the skills and knowledge traditionally reserved for managers: they need to understand the profit and loss, plus they need to understand business goals and strategy and appreciate the consequences of their actions. And it means the white-collar managers of the managers need to respect the knowledge workers as peers rather than hired help, they need to be explain business goals, strategy and be open about the business.

It sometimes feels as if the “class war” of the early twentieth century is still with us. Only now it is not blue collar workers fighting white collar but white collar workers fighting between themselves.


Subscribe to my blog newsletter and download Continuous Digital for free

The post Class war in the modern workplace appeared first on Allan Kelly.

Moore’s law was a socially constructed project

Moore’s law was a socially constructed project that depended on the coordinated actions of many independent companies and groups of individuals to last for as long it did.

All products evolve, but what was it about Moore’s law that enabled microelectronics to evolve so much faster and for longer than most other products?

Moore’s observation, made in 1965 based on four data points, was that the number of components contained in a fabricated silicon device doubles every year. The paper didn’t make this claim in words, but a line fitted to four yearly data points (starting in 1962) suggested this behavior continuing into the mid-1970s. The introduction of IBM’s Personal Computer, in 1981 containing Intel’s 8088 processor, led to interested parties coming together to create a hugely profitable ecosystem that depended on the continuance of Moore’s law.

The plot below shows Moore’s four points (red) and fitted regression model (green line). In practice, since 1970, fitting a regression model (purple line) to the number of transistors in various microprocessors (blue/green, data from Wikipedia), finds that the number of transistors doubled every two years (code+data):

Transistors contained in a device over time, plus Moore's original four data-points.

In the early days, designing a device was mostly a manual operation; that is, the circuit design and logic design down to the transistor level were hand-drawn. This meant that creating a device containing twice as many transistors required twice as many engineers. At some point the doubling process either becomes uneconomic or it takes forever to get anything done because of the coordination effort.

The problem of needing an exponentially-growing number of engineers was solved by creating electronic design automation tools (EDA), starting in the 1980s, with successive generations of tools handling ever higher levels of abstraction, and human designers focusing on the upper levels.

The use of EDA provides a benefit to manufacturers (who can design differentiated products) and to customers (e.g., products containing more functionality).

If EDA had not solved the problem of exponential growth in engineers, Moore’s law would have maxed-out in the early 1980s, with around 150K transistors per device. However, this would not have stopped the ongoing shrinking of transistors; two economic factors independently incentivize the creation of ever smaller transistors.

When wafer fabrication technology improvements make it possible to double the number of transistors on a silicon wafer, then around twice as many devices can be produced (assuming unchanged number of transistors per device, and other technical details). The wafer fabrication cost is greater (second row in table below), but a lot less than twice as much, so the manufacturing cost per device is much lower (third row in table).

The doubling of transistors primarily provides a manufacturer benefit.

The following table gives estimates for various chip foundry economic factors, in dollars (taken from the report: AI Chips: What They Are and Why They Matter). Node, expressed in nanometers, used to directly correspond to the length of a particular feature created during the fabrication process; these days it does not correspond to the size of any specific feature and is essentially just a name applied to a particular generation of chips.

Node (nm)                       90      65     40     28      20    16/12     10       7       5
Foundry sale price per wafer  1,650   1,937  2,274  2,891   3,677   3,984   5,992   9,346  16,988
Foundry sale price per chip   2,433   1,428    713    453     399     331     274     233     238
Mass production year          2004    2006   2009   2011    2014    2015    2017    2018   2020
Quarter                        Q4      Q4     Q1     Q4      Q3      Q3      Q2      Q3     Q1
Capital investment per wafer  4,649   5,456  6,404  8,144  10,356  11,220  13,169  14,267  16,746
processed per year
Capital consumed per wafer      411     483    567    721     917     993   1,494   2,330   4,235
processed in 2020
Other costs and markup        1,293   1,454  1,707  2,171   2,760   2,990   4,498   7,016  12,753
per wafer

The second economic factor incentivizing the creation of smaller transistors is Dennard scaling, a rarely heard technical term named after the first author of a 1974 paper showing that transistor power consumption scaled with area (for very small transistors). Halving the area occupied by a transistor, halves the power consumed, at the same frequency.

The maximum clock-frequency of a microprocessor is limited by the amount of heat it can dissipate; the heat produced is proportional to the power consumed, which is approximately proportional to the clock-frequency. Instead of a device having smaller transistors consume less power, they could consume the same power at double the frequency.

Dennard scaling primarily provides a customer benefit.

Figuring out how to further shrink the size of transistors requires an investment in research, followed by designing/(building or purchasing) new equipment. Why would a company, who had invested in researching and building their current manufacturing capability, be willing to invest in making it obsolete?

The fear of losing market share is a commercial imperative experienced by all leading companies. In the microprocessor market, the first company to halve the size of a transistor would be able to produce twice as many microprocessors (at a lower cost) running twice as fast as the existing products. They could (and did) charge more for the latest, faster product, even though it cost them less than the previous version to manufacture.

Building cheaper, faster products is a means to an end; that end is receiving a decent return on the investment made. How large is the market for new microprocessors and how large an investment is required to build the next generation of products?

Rock’s law says that the cost of a chip fabrication plant doubles every four years (the per wafer price in the table above is increasing at a slower rate). Gambling hundreds of millions of dollars, later billions of dollars, on a next generation fabrication plant has always been a high risk/high reward investment.

The sales of microprocessors are dependent on the sale of computers that contain them, and people buy computers to enable them to use software. Microprocessor manufacturers thus have to both convince computer manufacturers to use their chip (without breaking antitrust laws) and convince software companies to create products that run on a particular processor.

The introduction of the IBM PC kick-started the personal computer market, with Wintel (the partnership between Microsoft and Intel) dominating software developer and end-user mindshare of the PC compatible market (in no small part due to the billions these two companies spent on advertising).

An effective technique for increasing the volume of microprocessors sold is to shorten the usable lifetime of the computer potential customers currently own. Customers buy computers to run software, and when new versions of software can only effectively be used in a computer containing more memory or on a new microprocessor which supports functionality not supported by earlier processors, then a new computer is needed. By obsoleting older products soon after newer products become available, companies are able to evolve an existing customer base to one where the new product is looked upon as the norm. Customers are force marched into the future.

The plot below shows sales volume, in gigabytes, of various sized DRAM chips over time. The simple story of exponential growth in sales volume (plus signs) hides the more complicated story of the rise and fall of succeeding generations of memory chips (code+data):

Sales volume, in gigabytes, of various sized DRAM chips over time.

The Red Queens had a simple task, keep buying the latest products. The activities of the companies supplying the specialist equipment needed to build a chip fabrication plant has to be coordinated, a role filled by the International Technology Roadmap for Semiconductors (ITRS). The annual ITRS reports contain detailed specifications of the expected performance of the subsystems involved in the fabrication process.

Moore’s law is now dead, in that transistor doubling now takes longer than two years. Would transistor doubling time have taken longer than two years, or slowed down earlier, if:

  • the ecosystem had not been dominated by two symbiotic companies, or did network effects make it inevitable that there would be two symbiotic companies,
  • the Internet had happened at a different time,
  • if software applications had quickly reached a good enough state,
  • if cloud computing had gone mainstream much earlier.

Focus is not divisable so limit you OKRs

From time to time I hear about teams who have 8, 9, 10 or more OKRs in a quarter. That is just plain wrong. In Succeeding with OKRs in Agile I suggest 3 Objectives per quarters each with 3 key results. When I hear the cries of pain and people twist my arm I compromise on 4 objectives and about 4 key results.

Now those numbers are MAXIMUMs, I’d really like fewer, and I’ve heard of teams which have just 1 – yes ONE – objective per quarter. I’m itching to try that with a team.

Sometimes people respond and say: “Arhh, but we have a big team, I agree with 3 being the right number for a team of six but we have a team of 16 so surely we could have more objectives?”

But actually, when you have a bigger team you have a bigger problem and hence even more reason to limit the number of OKRs.

Part of the power of OKRs is that they create and maintain Focus. Having agreed and stated outcomes to work towards gives individuals something to focus, it gives team members – and particularly product owners – a reason to say No when more work appears. It keeps the team honest when looking at what needs doing and deciding how to spent their time.

New options to learn about OKRs and Agile

Focus is not divisible – devide your focus and you no longer have focus. When you have a bigger team you have more need for focus rather than less. One could even argue that that as the team grows the number of OKRs should reduce not increase.

Bigger teams, because there are more people, struggle more with focus than small teams. On a small team the lack of capacity forces trade-offs and brings people face-to-face with limited capacity. On a big team its easy to think one or two people can go and do something different, or even for individuals to hide.

By the way, this applies equally if you extend the OKR cycle: setting OKRs every six months rather than every three should be a reason to reduce the number of OKRs rather than increase them.

Once upon a time I worked with a team that had real focus problems: teams members found little overlap in their work. Consequently there were seven or eight OKRs each month. That was itself information, when you looked at the OKRs they were disjoint, the team was not focusing because it had three – or four – very different work streams and the people on the team had different skills.

The solution was to split the team into three mini-teams each with their own OKRs. One could argue that the full team got more OKRs but what happened was that each mini-team could now focus and work towards their goal with focus, with less distraction and greater purpose.

This keeps things simple – the Rule of Three! – and keep things focused.


Subscribe to my blog newsletter and download Continuous Digital for free


Photo by David Travis on Unsplash

The post Focus is not divisable so limit you OKRs appeared first on Allan Kelly.

OKRs workshop, tutorials and free stuff

Two opportunities to learn more about OKRs. Both based on Succeeding with OKRs in Agile.

Implementing OKRs in Agile

24 February: 1-day online workshop, hosted by iLean in Belgium and open to all.

Combining OKR and Agile

Online tutorial series – this is a mix free and paid for material.

You can buy the tutorials individually or as a bundle. Subscribing to the bundle is much cheaper and gives access to new tutorials as I add them. My plan is to add one new tutorial each month.

Use the code blogreader to get 20% the paid elements.

The post OKRs workshop, tutorials and free stuff appeared first on Allan Kelly.

Including natural language text topics in a regression model

The implementation records for a project sometimes include a brief description of each task implemented. There will be some degree of similarity between the implementation of some tasks. Is it possible to calculate the degree of similarity between tasks from the text in the task descriptions?

Over the years, various approaches to measuring document similarity have been proposed (more than you probably want to know about natural language processing).

One of the oldest, simplest and widely used technique is term frequency–inverse document frequency (tf-idf), which is based on counting word frequencies, i.e., is word context is ignored. This technique can work well when there are a sufficient number of words to ensure a good enough overlap between similar documents.

When the description consists of a sentence or two (i.e., a summary), the problem becomes one of sentence similarity, not document similarity (so tf-idf is unlikely to be of any use).

Word context, in a sentence, underpins the word embedding approach, which represents a word by an n-dimensional vector calculated from the local sentence context in which the word occurs (derived from a large amount of text). Words that are closer, in this vector space, are expected to have similar meanings. One technique for calculating the similarity between sentences is to compare the averages of the word embedding of the words they contain. However, care is needed; words appearing in the same context can create sentences having different meanings, as in the following (calculated sentence similarity in the comments):

import spacy
nlp=spacy.load("en_core_web_md") # _md model needed for word vectors
nlp("the screen is black").similarity(nlp("the screen is white"))
# 0.9768339369182919  # closer to 1 the more similar the sentences
nlp("implementing widgets would be little effort").similarity(nlp("implementing widgets would be a huge effort"))
# 0.9636533803238744
nlp("the screen is black").similarity(nlp("implementing widgets would be a huge effort"))
# 0.6596892830922606

The first pair of sentences are similar in that they are about the characteristics of an object (i.e., its colour), while the second pair are similar in that are about the quantity of something (i.e., implementation effort), and the third pair are not that similar.

The words in a document, or summary, are about some collection of topics. A set of related documents are likely to contain a discussion of a set of related topics in varying degrees. Latent Dirichlet allocation (LDA) is a widely used technique for calculating a set of (unseen) topics from a set of documents and their contained words.

A recent paper attempted to estimate task effort based on the similarity of the task descriptions (using tf-idf). My last semi-serious attempt to extract useful information from text, some years ago, was a miserable failure (it’s a very hard problem). Perhaps better techniques and tools are now available for me to leverage (my interest is in understanding what is going on, not making predictions).

My initial idea was to extract topics from task data, and then try to add these to regression models of task effort estimation, to see what impact they had. Searching to find out what researchers have recently been doing in this area, I was pleased to see that others were ahead of me, and had implemented R packages to do the heavy lifting, in particular:

  • The stm package supports the creation of Structural Topic Models; these add support for covariates to influence the process of fitting LDA models, i.e., a correlation between the topics and other variables in the data. Uses of STM appear to be oriented towards teasing out differences in topics associated with different values of some variable (e.g., political party), and the package authors have written papers analysing political data.
  • The psychtm package supports what the authors call supervised latent Dirichlet allocation with covariates (SLDAX). This handles all the details needed to include the extracted LDA topics in a regression model; exactly what I was after. The user interface and documentation for this package is not as polished as the stm package, but the code held together as I fumbled my way through.

To experiment using these two packages I used the SiP dataset, which includes summary text for each task, and I have previously analysed the estimation task data.

The stm package:

The textProcessor function handles all the details of converting a vector of strings (e.g., summary text) to internal form (i.e., handling conversion to lower case, removing stop words, stemming, etc).

One of the input variables to the LDA process is the number of topics to use. Picking this value is something of a black art, and various functions are available for calculating and displaying concepts such as topic semantic coherence and exclusivity, the most commonly used words associated with a topic, and the documents in which these topics occur. Deciding the extent to which 10 or 15 topics produced the best results (values that sounded like a good idea to me) required domain knowledge that I did not have. The plot below shows the extent to which the words in topic 5 were associated with the Category column having the value “Development” or “Management” (code+data):

Distribution of words contained in topics associated with Development and Management.

The psychtm package:

The prep_docs function is not as polished as the equivalent stm function, but the package’s first release was just last year.

After the data has been prepared, the call to fit a regression model that includes the LDA extracted topics is straightforward:

sip_topic_mod=gibbs_sldax(log(HoursActual) ~ log(HoursEstimate), data = cl_info,
                         docs = docs_vocab$documents, model = "sldax",
                         K = 10 # number of topics)

where: log(HoursActual) ~ log(HoursEstimate) is the simplest model fitted in the original analysis.

The fitted model had the form: HoursActual approx HoursEstimate^{0.81} e^{0.13 topic_1} e^{0.18 topic_2}..., with the calculated coefficient for some topics not being significant. The value 0.81 is close to that fitted in the original model. The value of topic_i is the fraction of the topic_i calculated to be present in the Summary text of the corresponding task.

I’m please to see that a regression model can be improved by adding topics derived from the Summary text.

The SiP data includes other information such as work Category (e.g., development, management), ProjectCode and DeveloperId. It is to be expected that these factors will have some impact on the words appearing in a task Summary, and hence the topics (the stm analysis showed this effect for Category).

When the model formula is changed to: log(HoursActual) ~ log(HoursEstimate)+ProjectCode, the quality of fit for most topics became very poor. Is this because ProjectCode and topics conveyed very similar information, or did I need to be more sophisticated when extracting topic models? This needs further investigation.

Can topic models be used to build prediction models?

Summary text can only be used to make predictions if it is available before the event being predicted, e.g., available before a task is completed and the actual effort is known. My interest in model building is to understand the processes involved, so I am not worried about when the text was created.

My own habit is to update, or even create Summary text once a task is complete. I asked Stephen Cullen, my co-author on the original analysis and author of many of the Summary texts, about the process of creating the SiP Summary sentences. His reply was that the Summary field was an active document that was updated over time. I suspect the same is true for many task descriptions.

Not all estimation data includes as much information as the SiP dataset. If Summary text is one of the few pieces of information available, it may be possible to use it as a proxy for missing columns.

Perhaps it is possible to extract information from the SiP Summary text that is not also contained in the other recorded information. Having been successful this far, I will continue to investigate.

On A Day At The Races – student

Most recently the Baron challenged Sir R----- to a race of knights around the perimeter of a chessboard, with the Baron starting upon the lower right hand square and Sir R----- upon the lower left. The chase proceeded anticlockwise with the Baron moving four squares at each turn and Sir R----- by the roll of a die. Costing Sir R----- one cent to play, his goal was to catch or overtake the Baron before he reached the first rank for which he would receive a prize of forty one cents for each square that the Baron still had to traverse before reaching it.

Tracking software evolution via its Changelog

Software that is used evolves. How fast does software evolve, e.g., much new functionality is added and how much existing functionality is updated?

A new software release is often accompanied by a changelog which lists new, changed and deleted functionality. When software is developed using a continuous release process, the changelog can be very fine-grained.

The changelog for the Beeminder app contains 3,829 entries, almost one per day since February 2011 (around 180 entries are not present in the log I downloaded, whose last entry is numbered 4012).

Is it possible to use the information contained in the Beeminder changelog to estimate the rate of growth of functionality of Beeminder over time?

My thinking is driven by patterns in a plot of the Renzo Pomodoro dataset. Renzo assigned a tag-name (sometimes two) to each task, which classified the work involved, e.g., @planning. The following plot shows the date of use of each tag-name, over time (ordered vertically by first use). The first and third black lines are fitted regression models of the form 1-e^{-K*days}, where: K is a constant and days is the number of days since the start of the interval fitted; the second (middle) black line is a fitted straight line.

at-words usage, by date.

How might a changelog line describing a day’s change be distilled to a much shorter description (effectively a tag-name), with very similar changes mapping to the same description?

Named-entity recognition seemed like a good place to start my search, and my natural language text processing tool of choice is still spaCy (which continues to get better and better).

spaCy is Python based and the processing pipeline could have all been written in Python. However, I’m much more fluent in awk for data processing, and R for plotting, so Python was just used for the language processing.

The following shows some Beeminder changelog lines after stripping out urls and formatting characters:

Cheapo bug fix for erroneous quoting of number of safety buffer days for weight loss graphs.
Bugfix: Response emails were accidentally off the past couple days; fixed now. Thanks to user bmndr.com/laur  for alerting us!  
More useful subject lines in the response emails, like "wrong lane!" or whatnot.
Clearer/conciser stats at bottom of graph pages. (Will take effect when you enter your next datapoint.) Progress, rate, lane, delta.  
Better handling of significant digits when displaying numbers. Cf stackoverflow.com/q/5208663

The code to extract and print the named-entities in each changelog line could not be simpler.

import spacy
import sys

nlp = spacy.load("en_core_web_sm") # load trained English pipelines

count=0 
        
for line in sys.stdin:
   count += 1 
   print(f'> {count}: {line}')
#
   doc=nlp(line) # do the heavy lifting
#          
   for ent in doc.ents:  # iterate over detected named-entities
      print(ent.lemma_, ent.label_)

To maximize the similarity between named-entities appearing on different lines the lemmas are printed, rather than original text (i.e., words appear in their base form).

The label_ specifies the kind of named-entity, e.g., person, organization, location, etc.

This code produced 2,225 unique named-entities (5,302 in total) from the Beeminder changelog (around 0.6 per day), and failed to return a named-entity for 33% of lines. I was somewhat optimistically hoping for a few hundred unique named-entities.

There are several problems with this simple implementation:

  • each line is considered in isolation,
  • the change log sometimes contains different names for the same entity, e.g., a person’s full name, Christian name, or twitter name,
  • what appear to be uninteresting named-entities, e.g., numbers and dates,
  • the language does not know much about software, having been training on a corpus of general English.

Handling multiple names for the same entity would a lot of work (i.e., I did nothing), ‘uninteresting’ named-entities can be handled by post-processing the output.

A language processing pipeline that is not software-concept aware is of limited value. spaCy supports adding new training models, all I need is a named-entity model trained on manually annotated software engineering text.

The only decent NER training data I could find (trained on StackOverflow) was for BERT (another language processing tool), and the data format is very different. Existing add-on spaCy models included fashion, food and drugs, but no software engineering.

Time to roll up my sleeves and create a software engineering model. Luckily, I found a webpage that provided a good user interface to tagging sentences and generated the json file used for training. I was patient enough to tag 200 lines with what I considered to be software specific named-entities. … and now I have broken the NER model I built…

The following plot shows the growth in the total number of named-entities appearing in the changelog, and the number of unique named-entities (with the 1,996 numbers and dates removed; code+data);

Growth of total and unique named-entities in the Beeminder changelog.

The regression fits (red lines) are quadratics, slightly curving up (total) and down (unique); the linear growth components are: 0.6 per release for total, and 0.46 for unique.

Including software named-entities is likely to increase the total by at least 15%, but would have little impact on the number of unique entries.

This extraction pipeline processes one release line at a time. Building a set of Beeminder tag-names requires analysing the changelog as a whole, which would take a lot longer than the day spent on this analysis.

The Beeminder developers have consistently added new named-entities to the changelog over more than eleven years, but does this mean that more features have been consistently added to the software (or are they just inventing different names for similar functionality)?

It is not possible to answer this question without access to the code, or experience of using the product over these eleven years.

However, staying in business for eleven years is a good indicator that the developers are doing something right.