About Legion Runtime Class

These notes are closely based on the set of Legion Runtime Class videos produced by the Legion developers. They are my own notes and code walks, and any errors or things that are just plain wrong represent my own mistakes.

Today's notes are based on the following video:

Overview

Today we'll look at how the Legion runtime manages the execution of operations. Recall that tasks in Legion can spin up new tasks, as well as launch operations. There are many different types of operations that can be executed, including things like inline mappings, and data copies between regions. The runtime on each node in a Legion applications contains an operational pipeline that is used to execute tasks and other operations.

Getting to the Pipeline

Most task launches start from something like Runtime::execute_task. Here we construct a task object (which is actually an operation of type TaskOp). We initialize the TaskOp and proceed to execute_task_launch.

Future Runtime::execute_task(Context ctx, const TaskLauncher &launcher)
{
  IndividualTask *task = get_available_individual_task();
  Future result = task->initialize_task(ctx, launcher, false/*check privileges*/);
  execute_task_launch(ctx, task);
}

The first stage of the execution pipeline is dependence analysis. Here we add the task to the dependence analysis queue (note that I have removed a lot of code from these examples to highlight the code paths we are interested in today):

void Runtime::execute_task_launch(Context ctx, TaskOp *task)
{
  Processor proc = ctx->get_executing_processor();
  add_to_dependence_queue(proc, task);
}

Next we prepare a special runtime task that will perform the dependence analysis for us. The ID of the operation is given by HLR_TRIGGER_DEPENDENCE_ID. A utility process will handle the work for us. Note the dependence_preconditions data structure. Dependence analysis is a serial operation, and this data structure records the program order and then uses this order to enforce the execution of the tasks performing dependence analysis:

void ProcessorManager::add_to_dependence_queue(Operation *op)
{
  DeferredTriggerArgs args;
  args.hlr_id = HLR_TRIGGER_DEPENDENCE_ID;
  args.manager = this;
  args.op = op;
  ContextID ctx_id = op->get_parent()->get_context_id();
  AutoLock d_lock(dependence_lock);
  Event next = utility_proc.spawn(HLR_TASK_ID, &args, sizeof(args),
            dependence_preconditions[ctx_id]);
  dependence_preconditions[ctx_id] = next;
}

How did this high-level runtime task come into existence? The high-level runtime task is registered in Runtime::register_runtime_tasks, along with some other meta operations:

void Runtime::register_runtime_tasks(Processor::TaskIDTable &table)
{
  ...
  table[INIT_FUNC_ID]                 = Runtime::initialize_runtime;
  table[SHUTDOWN_FUNC_ID]   = Runtime::shutdown_runtime;
  table[HLR_TASK_ID]                 = Runtime::high_level_runtime_task;
}

The Runtime::high_level_runtime_task function is what is run each time we launch a task with the HLR_TASK_ID task ID. Note that back in ProcessorManager::add_to_dependence_queue that this is the ID of the task we launched, so eventually the Runtime::high_level_runtime_task will be executed with the parameters provided when it was launched.

Note that Runtime::high_level_runtime_task is effectively a big switch statement that is used to multiplex a bunch of different types of work onto a single task. Not shown here for brevity is how tid is set. For the code path we are tracing today, this will be equal to HLR_TRIGGER_DEPENDENCE_ID set back in ProcessorManager::add_to_dependence_queue:

void Runtime::high_level_runtime_task(const void *args, size_t arglen, Processor p)
{
  ...
  switch (tid)
  {
    case HLR_SCHEDULER_ID: ...
    case HLR_MESSAGE_ID: ...
    case HLR_POST_END_ID: ...
    case HLR_TRIGGER_DEPENDENCE_ID:
    {
      const ProcessorManager::DeferredTriggerArgs *deferred_trigger_args =
         (const ProcessorManager::DeferredTriggerArgs*)args;
      deferred_trigger_args->op->trigger_dependence_analysis();
      break;
    }
    ...
}

Note that the HLR_TRIGGER_DEPENDENCE_ID invokes the trigger_dependence_analysis method on the operation that was passed in. This marks the very beginning of the execution of the operation in the pipeline.

Operation Type

Each operation in Legion defines how it progresses through the pipeline. We'll look at this in detail now. The type Operation defined in legion_ops.h is the root type of a type tree of all operations. All operations are defined in legion_ops.h and implemented in legion_ops.cc. The exception is the task which is a special type of operation and is defined in legion_tasks.h. The portion of the Operation type that distinguishes one operation from another is the methods associated with the operation's behavior in the execution pipeline. There is a trigger_xxx for each stage of the pipeline indicating that the particular stage is ready, and a corresponding complete_xxx that is called when a stage has completed execution.

class Operation {
  ...
  public:
    virtual void trigger_dependence_analysis(void);
    virtual void trigger_mapping(void);
    virtual bool trigger_execution(void);
  ...
  public:
    void complete_mapping(void);
    void complete_execution(void);
  ...

When a new operation extends the Operation class the default implementation of the virtual methods shown above give you an operation instance that is immediately able to run through the pipeline (as a no-op). This means that it is relatively easy to get started developing new operations by incrementally customizing its execution stages.

For example, the implementation of trigger_dependence_analysis for the base Operation class is simple because it doesn't have anything to do. There is no dependence analysis necessary because there isn't any region requirements for the base operation type, so it has no data dependencies on operations that came before it.

void Operation::trigger_dependence_analysis(void)
{
  begin_dependence_analysis();
  end_dependence_analysis();
}

In contrast, the MapOp is for inline mapping, and performs dependence analysis on its region requirements:

void MapOp::trigger_dependence_analysis(void)
{
  begin_dependence_analysis();
  runtime->forest->perform_dependence_analysis(parent_ctx->get_context(),
         this, 0/*idx*/, requirement, privilege_path);
  end_dependence_analysis();
}

Another example is trigger_mapping. After the mapping stage has completed the operation is ready to execute. For the base operation there is no mapping necessary, so it is immediately queued up for execution. We'll look at this case below in more detail. Right now we just want to get an overview of what types of things operation pipeline stages might be doing.

void Operation::trigger_mapping(void)
{
  runtime->add_to_local_queue(parent_ctx->get_executing_processor(),
                                  this, false/*prev fail*/);
}

Now we'll look at trigger_execution. Note that it returns a boolean value indicating success or failure. Failure may occur in cases where there are not enough resources for the operating to map. In that case the runtime will try again later.

bool Operation::trigger_execution(void)
{
  // Mark that we finished mapping
  complete_mapping();
  // If we have nothing to do also mark that we have completed execution
  complete_execution();
  // Return true indicating we successfully triggered
  return true;
  }

When we return from these trigger_xxx methods the runtime doesn't assume that that stage has completed. The operation must explicitly call the complete_xxx methods (e.g. complete_mapping above) to indicate completion. This is useful because operations can defer completion and customize behavior. For instance, before marking the execution stage as complete, an operation may want to launch more low-level tasks, and some later point call complete_execution.

This is all very similar to the way out-of-order execution occurs in a processor. The operations are progressing through different stages of the pipeline. Certain things like dependence analysis always comes first. But other stages may occur out of order, for example it may be the case that at a particular stage we have already completed other stages that logically occur later and we can mark those stages as complete.

Let's look at Operation::complete_execution to make some of this more concrete. There is a lot of logic that is evaluated to decide what needs to happen next in the pipeline. For instance, we evaluate operation state to decide if it is time to trigger resolution or to trigger completion. I've removed some things from complete_execution to make the point clear (see the complete code for all the details).

void Operation::complete_execution(void)
{
  bool need_resolution = false;
  bool need_complete = false;
  ...
  // If we haven't been resolved and we've already mapped, check to see
  // if all of our speculation deps have been satisfied
  if (mapped && !resolved && (outstanding_speculation_deps == 0) &&
      !trigger_resolution_invoked)
  {
    trigger_resolution_invoked = true;
    need_resolution = true;
  }
  if (mapped && resolved && !trigger_complete_invoked)
  {
    trigger_complete_invoked = true;
    need_complete = true;
  }
  ...
  if (need_resolution)
    trigger_resolution();
  if (need_complete)
    trigger_complete();
}

The important thing to note here is that the complete_xxx methods are non-virtual, meaning that their behavior is baked into the pipeline. It is unlikely that you will need to ever change these methods, and if you find yourself doing so, it may be the case that there is another way to accomplish something.

For each complete_xxx operation there is similar logic that decides what should happen next. So, how do we actually get to the next stage in a pipeline? Let's look at Operation::trigger_mapping as an example. We call Runtime::add_to_local_queue:

void Operation::trigger_mapping(void)
{
  // Then put this thing on the ready queue
  runtime->add_to_local_queue(parent_ctx->get_executing_processor(),
             this, false /*prev fail*/);
}

After a stop in the runtime, we make it to the processor manager:

void ProcessorManager::add_to_local_ready_queue(Operation *op,  bool prev_failure)
{
  args.hlr_id = HLR_TRIGGER_OP_ID;
  args.manager = this;
  args.op = op;
  if (!prev_failure)
  {
    AutoLock l_lock(local_queue_lock); 
    Event next = utility_proc.spawn(HLR_TASK_ID, &args, sizeof(args),
                              local_scheduler_preconditions[next_local_index]);

Here we will spawn a high-level runtime task as we had shown above for starting dependence analysis. In this case, we will select the HLR_TRIGGER_OP_ID value for that giant switch statement. This lands us here which as we can see actually calls op->trigger_execution. If execution fails for some reason, the task goes back onto the ready queue. Now that's pretty cool.

case HLR_TRIGGER_OP_ID:
  {
    // Key off of args here instead of data
    const ProcessorManager::TriggerOpArgs *trigger_args = 
                            (const ProcessorManager::TriggerOpArgs*)args;
    Operation *op = trigger_args->op;
    bool mapped = op->trigger_execution();
    if (!mapped)
    {
        ProcessorManager *manager = trigger_args->manager;
              manager->add_to_local_ready_queue(op, true/*failure*/);
    }
    break;
  }

So that's an overview of operations at a super high level. Next we'll briefly look at how the top-level task starts up. It's very similar.

Top-level Task

Execution of a Legion application begins with a single top level task executing on one processor. Recall the method used to bootstrap the system with a single task, that I've annotated here with a high-level description of its operation:

void Runtime::launch_top_level_task(Processor proc)
{
   // Create TaskLauncher
   // Copy in task arguments
   // Set task arguments from mapper
   ...
   add_to_ready_queue(proc, top_task, false);
}

The important component for us here is the method add_to_ready_queue (sound familiar?), which is the entry point to the Legion operation pipeline (sorta... the entry point for the top-level task. see above for other task types).

Here we have arrived at the starting point. The prev_fail can be ignored for now. In the context of the top-level task we can assume there are no previous failures. The first thing we do is add the task to the processor manager's ready queue.

void Runtime::add_to_ready_queue(Processor p, TaskOp *op, bool prev_fail)
{
  proc_managers[p]->add_to_ready_queue(op, prev_fail);
}

There is a processor manager for every low-level non-utility processor. After a task is ready to run it gets assigned to a processor manager.

void ProcessorManager::add_to_ready_queue(TaskOp *task, bool prev_failure)
{
  ...
  if (prev_failure)
    ready_queues[task->map_id].push_front(task);
  else
    ready_queues[task->map_id].push_back(task);
  ...
}

Note that the top level task enjoys special treatment in the pipeline. In particular, it has no dependencies, and as such can skip the dependency analysis that all other tasks must suffer through.