About Legion Runtime Class

These notes are closely based on the set of Legion Runtime Class videos produced by the Legion developers. They are my own notes and code walks, and any errors or things that are just plain wrong represent my own mistakes.

Today's notes are based on the following video:

Runtime Overview

Legion is separated into two main components: the high-level runtime (hlr) and the low-level runtime (llr). While the llr (Realm) is concerned with primitives that support execution of operations on a distributed machine, the hlr focuses on the secret Legion sauce.

From a code POV, lowlevel.h is the boundary between the HLR and LLR. The naming convention of files in the Legion repository generally follows the rule that files associated with LLR stuff contain the word lowlevel in the filename, with the exception being accessor and activemsg. When reading code, C++ namespaces are used to separate the hlr (LegionRuntime::HighLevel) from the llr (LegionRuntime::LowLevel).

There are currently two implementation of the llr. The first is the high-performance implementation (lowlevel.cc) that uses GASNet and supports GPUs and other HPC relevant things. The second is an implementation designed for use on a single node (shared_lowlevel.cc). These two implementations may be unified at some point.

In contrast to a typical MPI application, Legion prefers to have a single instance of the run-time running per node (as opposed to per process) which manages all of the node resources. Are there variations and exceptions?

Low-level Runtime Startup

A Legion application is typically started using a launcher (e.g. mpirun or GASNet launchers) that runs a copy of the application on every node in a system. When a Legion process is launched it will first configure the runtime (e.g. registering tasks) and then it will start the runtime:

int main() {
  return HighLevelRuntime::start(argc, argv);

The HighLevelRuntime::set_top_level_task_id is located in legion.cc and is only a thin wrapper around HighLevel::Runtime::set_top_level_task_id:

void HighLevelRuntime::set_top_level_task_id(Processor::TaskFuncID top_id)


void Runtime::set_top_level_task_id(Processor::TaskFuncID top_id)
  legion_main_id = top_id;

This level of indirection is primarily a result of several iterations of the APIs, and may be simplified in the future. A full discussion about the details of things like task registration and the associated data structures will be part of a future class on the llr? Note also that while task registration here is static, dynamic task registration will be supported in the future.

Once we've configured the runtime we start it. There are some static asserts at the beginning, and then the first thing we do is created a Machine instance.

int Runtime::start(int argc, char **argv, bool background)
  // Some static asserts that need to hold true for the runtime to work
  Machine *m = new Machine(&argc, &argv, 
     Runtime::get_task_table(true/*add runtime tasks*/), 
     Runtime::get_reduction_table(), false/*cps style*/);

A Machine instance contains lots of information about the (distributed) machine we are running on, not just the local node. Recall that the startup process being described is a result of the Legion process running on all nodes in the system, and as such, each Legion process will eventually have a copy of the Machine instance. That is, each Legion process knows about the structure of the entire system. We'll see how that information is obtained shortly.

The next thing that occurs during Runtime::start is that numerous high-level runtime options are parsed from the command line:

for (int i = 1; i < argc; i++)

Once this is complete we kick the machine to start running:

m->run(0, Machine::ONE_TASK_ONLY, 0, 0, background);
if (background)
  return 0;
  return -1;

When background is true we return to the caller, otherwise the Machine::run will take care of process termination. Returning to the caller is useful if there are other things that need to be done after starting Legion, such as managing threads or doing some MPI stuff.

LLR Bootstrap / Machine Startup

During runtime startup we created a Machine instance which stores all the details of the distributed machine. Two things are going on with Machine. During initialization we do all the discovery stuff. Then the runtime calls Machine::run which starts the ball rolling.

Discovery of machine details happen in the Machine constructor:

Machine::Machine(int *argc, char ***argv,
    const Processor::TaskIDTable &task_table,
    const ReductionOpTable &redop_table,
    bool cps_style /* = false */,
    Processor::TaskFuncID init_id /* = 0 */)
      : background_pthread(NULL)
  the_machine = this;

The the_machine can be considered a singleton. Next we stash away some of the parameters:

for(ReductionOpTable::const_iterator it = redop_table.begin();
  it != redop_table.end();
    reduce_op_table[it->first] = it->second;

In this case we are recording a mapping between reduction id and reduction implementation. We do the same thing for the set of tasks. Next up we parse all the low-level configuration flags from the command line:

for(int i = 1; i < *argc; i++) {
  INT_ARG("-ll:gsize", gasnet_mem_size_in_mb);
  INT_ARG("-ll:csize", cpu_mem_size_in_mb);
  INT_ARG("-ll:rsize", reg_mem_size_in_mb);

The next big thing that happens is that GASNet is initialized. In the high-performance llr implementation GASNet services as the fabric for inter-process messaging and memory exchange.

CHECK_GASNET( gasnet_init(argc, argv) );

Then we setup all of the GASNet active message handlers for the various types of messages that will be sent around between processes. There are a bunch of different message types:

gasnet_handlerentry_t handlers[128];
int hcount = 0;
hcount += NodeAnnounceMessage::add_handler_entries(&handlers[hcount], "Node Announce AM");
hcount += SpawnTaskMessage::add_handler_entries(&handlers[hcount], "Spawn Task AM");
hcount += LockRequestMessage::add_handler_entries(&handlers[hcount], "Lock Request AM");

There will be more information on the low-level runtime such as the active messages in a later discussion. Next we setup the actual low-level runtime. In this way, the Machine object effectively bootstraps the low-level runtime:

Runtime *r = Runtime::runtime = new Runtime;
r->nodes = new Node[gasnet_nodes()];

Next we create this thing called NodeAnnounceData announce_data; that will get filled up with the details of our local hardware like the different memories, processors, and GPUs. After we've filled in this data, we register the information with ourselves. Then we simultaneously broadcast our information to every other node as well as incorporate the announcement data from other nodes into our own view of the machine.

First, handle our own specs:

  AutoHSLLock al(announcement_mutex);
  parse_node_announce_data(adata, apos*sizeof(adata[0]), announce_data, false);

Then broadcast:

for(int i = 0; i < gasnet_nodes(); i++)
  if (i != gasnet_mynode())
    NodeAnnounceMessage::request(i, announce_data, adata,
      apos*sizeof(adata[0]), PAYLOAD_COPY);

And finally we wait for announcements from all other nodes in the system before proceeding. Note that during the handling of an announcement from a peer node, the details of that node are incorporating into our local view, so there is no more handling of that data explicitly during machine initialization.

while(announcements_received < (gasnet_nodes() - 1))

Those are the major points discussed regarding Machine setup. More details will be discussed in a later class. Next we'll go over Machine::run which is the second phase of low-level runtime startup.

void Machine::run(Processor::TaskFuncID task_id /*= 0*/,
    RunStyle style /*= ONE_TASK_ONLY*/,
    const void *args /*= 0*/, size_t arglen /*= 0*/,
    bool background /*= false*/)

Note the background parameter. If this is true then the first thing this method does is kick off a thread which will re-run this method immediately with background = false and then return. Next is the construction of worker threads for the various processors in the system. At this point things are rolling. So we don't do anything except wait for the "job" to complete:

while(running_proc_count.get() > 0) {

This loops around until there is nothing left to do. After that is just some clean-up. So, the natural question here is where did control go? How does the main task eventual start running? This is the domain of the high-level runtime bootstrapping process.

HLR Bootstrap

Recall that during Runtime::start the task table is passed to the constructor of the Machine instance that is created:

  Machine *m = new Machine(&argc, &argv, 
     Runtime::get_task_table(true/*add runtime tasks*/), 
     Runtime::get_reduction_table(), false/*cps style*/);

Notice that Runtime::get_task_table is passed true. If we look at this method:

Processor::TaskIDTable& Runtime::get_task_table(bool add_runtime_tasks /*= true*/)
  static Processor::TaskIDTable table;
  if (add_runtime_tasks)
  return table;

We see that when the task table is retrieved when creating the Machine instance that Runtime::register_runtime_tasks is called as a side effect. Digging a little further we see that some specific tasks are added to the task table:

void Runtime::register_runtime_tasks(Processor::TaskIDTable &table)
  table[INIT_FUNC_ID]          = Runtime::initialize_runtime;
  table[SHUTDOWN_FUNC_ID]      = Runtime::shutdown_runtime;
  table[HLR_TASK_ID]           = Runtime::high_level_runtime_task;

In particular, the special task with ID INIT_FUNC_ID is what is used to initialize the runtime. When a processor starts up it looks for such a task. In the following, INIT_FUNC_ID is equivalent to Processor::TASK_ID_PROCESSOR_INIT:

Processor::TaskIDTable::iterator it = task_id_table.find(Processor::TASK_ID_PROCESSOR_INIT);
if (it != task_id_table.end())
  (it->second)(0, 0, proc->me);

So in summary a processor gets spun up and first thing it does is looks for an initialization task to run. In this case it is Runtime::initialize_runtime. This routine eventually calls local_rt->launch_top_level_task before returning.

As part of the setup done in Runtime::initialize_runtime there is a Runtime instance created, referred to as local_rt, and then this is associated with each processor. It isn't clear what this is for, although it seems like it is the HLR, where as the Runtime instance created during Machine setup is the LLR. I think this will be discussed further in a future class.

We've finally reached Runtime::launch_top_level_task(Processor proc). This method does some basic context setup, and finally adds the task (remember legion_main_id ?) to the ready queue by calling add_to_ready_queue(proc, top_task, false/*prev failure*/); which then treats the top level task like any other task. Note that this method is only called once in the system. The uniqueness of the caller is determined by the check:

void Runtime::launch_top_level_task(Processor proc)
  if ((address_space == 0) && (proc == *(local_procs.begin()))) {

Phew. And that's it: the application is running.