About Legion Runtime Class

These notes are closely based on the set of Legion Runtime Class videos produced by the Legion developers. They are my own notes and code walks, and any errors or things that are just plain wrong represent my own mistakes.

Today's notes are based on the following video:

Overview

This class will discuss how application-level tasks execute, wrapping up much of the control structures for the high-level runtime.

Task Types

All tasks are instances of Operation and move through the pipeline stages. Tasks are of type TaskOp and are a direct subclass of SpeculativeOp. Task operations are defined in legion_task.h.

The TaskOp operation is an abstract class that extends SpeculativeOp. All tasks are of type TaskOp.
The SingleTask type is a TaskOp and represents a single application-level task running at some point in time. The Context we have seen is a typedef of SingleTask*.
The MultiTask is a TaskOp and is a special task that represents a collection of SingleTasks.
An IndividualTask is a SingleTask and is created when a single task is launched.
The PointTask is a SingleTask and is designed to operate within the context of an index space launch where each point in the space represents a task.
A WrapperTask is a SingleTask. A wrapper task is used to represent a task that should actually progress through the pipeline. Examples include:
- A RemoteTask is a proxy that tracks a remote task (e.g. a parent task).
- An InlineTask executes directly within a parent context and shares its resources.
An IndexTask is a MultiTask and is used during an index space launch.
A SliceTask is a MultiTask that represents a domain slice specified by a mapper via the slice_domain interface.

Task Execution

Most tasks implement the standard set of callbacks for the operational pipeline stages (e.g. dependence analysis), just like other operations that we have seen such as CopyOp or MapOp. Things become more interesting for task operations when we look at the trigger_execution phase that is called once a task is ready to map.

The first thing that is done in trigger_execution for a SingleTask is that the runtime checks to see if the task is running remotely or locally on the original node. We are going to look at the local case first, so we'll come back to the remote case.

bool SingleTask::trigger_execution(void)
{
  bool success = true;
  if (is_remote())
  {
    // this case covered below.

Here we are in the local case. This means that the task is running on the original processor that it was launched from. First we check to see if the task has been premapped. If this fails then at the bottom of trigger_execution we end up returning false to the caller indicating that we failed, and the runtime will then restart the task later. We'll also skip the must_epoch case today:

  else // !is_remote()
  {
    // Not remote
    if (is_premapped() || premap_task())
    {
      // See if we have a must epoch in which case
      // we can simply record ourselves and we are done
      if (must_epoch != NULL)
        must_epoch->register_single_task(this, must_epoch_index);
      else

So in the common case premapping is complete, and we look to see if the mapper has requested that the task be mapped locally. If this is the case and the target processor is remote then we perform the mapping locally and call distribute_task to send the task to a remote node. Note that it is valid for a task to map on a node other than where it is executed, hence the logic:

      {
        // See if this task is going to be sent
        // remotely in which case we need to do the
        // mapping now, otherwise we can defer it
        // until the task ends up on the target processor
        if (is_locally_mapped() && !runtime->is_local(target_proc))
        {
          if (perform_mapping())
          {
              distribute_task();
          }
          else // failed to map
            success = false; 
        }

In the next case the mapper hasn't requested a local mapping, so we need to either send it away or map and run it locally. The distribute_task call with return true if the task is targeted to the current processor and return true. It will return false if it is sent to a different processor. When it is local, the mapping is completed and the task is launched.

        else
        {
          if (distribute_task())
          {
            // Still local so try mapping and launching
            if (perform_mapping())
            {
              // Still local and mapped so
              // we can now launch it
              launch_task();
            }
            else // failed to map
              success = false;
          }
        }
      }
    }
    else // failed to premap
      success = false;
  }

  return success;
}

Whenever a task is launched the runtime always executes this local path in trigger_execution. The case where a task is executing remotely is always a result of the machinery in distribute_task sending the task away where it will be injected into the operational pipeline of a runtime on another node.

Here is the distributed_task call that will send the task to a different target processor:

bool IndividualTask::distribute_task(void)
{
  if (target_proc != current_proc) {
    runtime->send_task(target_proc, this);
    return false;
  }
  return true;
}

The first thing that happens in send_task is to check if the target processor is a processor on the local node. If it is then things are simple; the task is added to the ready queue where eventually it will be chosen and trigger_execution will occur and we'll be right back where we started.

void Runtime::send_task(Processor target, TaskOp *task)
{
  // Check to see if the target processor is still local 
  std::map<Processor,ProcessorManager*>::const_iterator finder = 
      proc_managers.find(target);
  if (finder != proc_managers.end()) {
    // Update the current processor
    task->current_proc = target;
    finder->second->add_to_ready_queue(task,false/*previous failure*/);
}

If the target processor is on a remote node then we package up the task and send it via the MessegeManager for the target processor. The manager contains a call send_task that handles task messages:

  else {
    MessageManager *manager = find_messenger(target);
    Serializer rez;
    bool deactivate_task;
    {
      RezCheck z(rez);
      rez.serialize(target);
      rez.serialize(task->get_task_kind());
      deactivate_task = task->pack_task(rez, target);
    }
    // Put it on the queue and send it
    manager->send_task(rez, true/*flush*/);
    if (deactivate_task)
      task->deactivate();
  }
}

We've seen bits of this before. The message manager packages up the task message and sends it out over the network.

void MessageManager::send_task(Serializer &rez, bool flush)
{
  package_message(rez, TASK_MESSAGE, flush);
}

What we haven't seen yet is how these messages are handled on receipt. All of the message types are dispatched through the handle_messages call. For each of the messages being handled the type of message and a set of argument is decoded (omitted). Then a switch statement is used to select the action based on the message type. We've clipped the entire method, but show the case TASK_MESSAGE which calls Runtime::handle_task.

void MessageManager::handle_messages(unsigned num_messages,
   const char *args, size_t arglen)
{
  for (unsigned idx = 0; idx < num_messages; idx++) {
    MessageKind kind = *((const MessageKind*)args);
    args += sizeof(kind);
    // ... deconstruct message ...
    Deserializer derez(args,message_size);
    switch (kind) {
    case TASK_MESSAGE:
    {
      runtime->handle_task(derez);
      break;
    }
...

Let's jump now to handle_task:

void Runtime::handle_task(Deserializer &derez)
{
  TaskOp::process_unpack_task(this, derez);
}

So the TaskOp knows how to deserialize itself. We've been following the single task case today, so for that type of task as soon as it is deserialized into a local task operation structure it is put onto the ready queue. Remember we are on the remote side now, so it is placed into the pipeline on the remote node's runtime where it will be picked up and make it back to trigger_execution, but this time for the case that the task is running remotely.

void TaskOp::process_unpack_task(Runtime *rt, Deserializer &derez)
{
  // Figure out what kind of task this is and where it came from
  DerezCheck z(derez);
  Processor current;
  derez.deserialize(current);
  TaskKind kind;
  derez.deserialize(kind);
  switch (kind)
  {
    case INDIVIDUAL_TASK_KIND:
    {
      IndividualTask *task = rt->get_available_individual_task();
      if (task->unpack_task(derez, current))
        rt->add_to_ready_queue(current, task, false/*prev fail*/);
      break;
    }
...

So this task will now pop back out in trigger_execution eventually and is_remote() will evaluate to true. The first thing we do in this case is give the runtime and mapper the opportunity to redistribute the task. It is important to note this, because tasks in Legion can move around arbitrarily, including back to the original node. And in the case that it is redistributed to the original node the machinery will ensure that eventually it executes the local case again.

If the task will execute on the remote node then we optionally map the task (if it wasn't already mapped) and then launch it.

bool SingleTask::trigger_execution(void)
{
  bool success = true;
  if (is_remote())
  {
    if (distribute_task())
    {
      // Still local
      if (is_locally_mapped())
      {
        // Remote and locally mapped means
        // we were already mapped so we can
        // just launch the task
        launch_task();
      }
      else
      {
        // Remote but still need to map
        if (perform_mapping())
        {
          launch_task();
        }
        else // failed to map
          success = false;
      }
    }
    // otherwise it was sent away
  }

`perform_mapping`

The perform_mapping call seen above in trigger_execution is what completes the mapping stage of the pipeline for the individual task operation. The first thing that is done is to ask the mapper which variant of the task should be run (e.g. CPU or GPU implementations). After that map_all_regions is called which is responsible for mapping data into physical instances for the task. Next if we mapped successfully we mark some state that indicates we cannot be stolen now, and then send a message back to the owning node (if we are remote) that announces where we ended up mapping. We won't talk today about virtual mappings or invalidations. Finally complete_mapping call is made, which as we have seen before, is a method on the Operation base class that marks the completion of the mapping stage of the pipeline.

bool IndividualTask::perform_mapping(bool mapper_invoked)
{
  // Before we try mapping the task, ask the mapper to pick a task variant
  runtime->invoke_mapper_select_variant(current_proc, this);

  // Now try to do the mapping, we can just use our completion
  // event since we know this task will object will be active
  // throughout the duration of the computation
  bool map_success = map_all_regions(target_proc, get_task_completion(), mapper_invoked);

  // If we mapped, then we are no longer stealable
  if (map_success)
    spawn_task = false;

  // If we succeeded in mapping and everything was mapped
  // then we get to mark that we are done mapping
  if (map_success && (num_virtual_mappings == 0)) {
    if (is_remote()) {
      // Send back the message saying that we finished mapping
      Serializer rez;
      pack_remote_mapped(rez);
      runtime->send_individual_remote_mapped(orig_proc, rez);
    } else
      issue_invalidations(runtime->address_space, false/*remote*/);

    // Mark that we have completed mapping
    complete_mapping();
  }
  return map_success;
}

`map_all_regions`

The most important bits in perform_mapping are found in map_all_regions. This is a generic call implemented on the SingleTask object and is used for all tasks to map different region requirements that were requested for the task. I've trimmed out a lot of the bits to show just the relevant parts for this discussion. The first thing that happens is that we ask the mapper to pick a ranking of memories for the physical mappings.

bool SingleTask::map_all_regions(Processor target, Event user_event, bool mapper_invoked)
{
  // ...
  if (!mapper_invoked)
    notify = runtime->invoke_mapper_map_task(current_proc, this);

Next we start looping over all of the region requirements for the task. There are several cases that are not shown here such as checking to see if the region is already mapped, and if not the region forest is asked to create a mapping which returns a reference. If there was a failure we break out of the loop and clean-up any mapped regions we created because we can't launch the task until all regions are mapped successfully.

  for (unsigned idx = 0; idx < regions.size(); idx++)
  {
    // ...

    // Otherwise we're going to do an actual mapping
    mapping_refs[idx] = runtime->forest->map_physical_region(
        enclosing_physical_contexts[idx],
        mapping_paths[idx],
        regions[idx],
        idx,
        this,
        current_proc,
        target
);
    if (mapping_refs[idx].has_ref() || IS_NO_ACCESS(regions[idx]))
    {
      virtual_mapped[idx] = false;
    }
    else
    {
      // Otherwise the mapping failed so break out
      map_success = false;
      regions[idx].mapping_failed = true;
      break;
    }
  }

When we succeed mapping, we register the task as a user of the region which will ensure that all of the physical instances are updated with valid data.

      physical_instances[idx] = 
        runtime->forest->register_physical_region(
            enclosing_physical_contexts[idx],
            mapping_refs[idx],
            regions[idx],
            idx,
            this,
            current_proc,
            user_event);

There is a lot more to this method, such as additional mapper notifications of failure and success scenarios. If everything worked out we return success to the caller. We'll return to this function in later classes.

`launch_task`

The final method from trigger_execution that we'll look at is launch_task. This is called to launch any type of application task. This is a huge function, and we won't go into too much detail in this class. There are 3 major steps in the function:

Compute the set of low-level event pre-conditions that must be met before the task can be launched. This is includes things such as:

Regions have valid data
Locks and reservations
Futures have valid data
Phase barriers

All of these completion events are collected in wait_on_events structure:

void SingleTask::launch_task(void)
{
  // ...
  // STEP 1: Compute the precondition for the task launch
  std::set<Event> wait_on_events;

We'll skip over all of the different types of pre-conditions, but here is an example of where the Future completion events are added to this set:

  // Now add get all the other preconditions for the launch
  for (unsigned idx = 0; idx < futures.size(); idx++)
  {
    Future::Impl *impl = futures[idx].impl; 
    wait_on_events.insert(impl->get_ready_event());
  }

Once all of the pre-conditions are computed they are merged into a single events:

  // Merge together all the events for the start condition 
  Event start_condition = Event::merge_events(wait_on_events);

The second step is construct the context for the task which includes the structures that are used to track all of the sub-operations that might be launched by the task.

The final step is to launch the task with the pre-conditions that were computed

  Event task_launch_event = launch_processor.spawn(low_id, &proxy_this,
               sizeof(proxy_this), start_condition, task_priority);