Overview

Today we'll take a look at index space tasks.

About Legion Runtime Class

These notes are closely based on the set of Legion Runtime Class videos produced by the Legion developers. They are my own notes and code walks, and any errors or things that are just plain wrong represent my own mistakes.

Today's notes are based on the following video:

Index Task Launch

FutureMap Runtime::execute_index_space(Context ctx, 
    const IndexLauncher &launcher)

Future Runtime::execute_index_space(Context ctx, 
    const IndexLauncher &launcher, ReductionOpID redop)

The first version returns a FutureMap which stores the results from each task defined by the index space, that is a result per point in the launch domain. The second form applies a reduction operator to all the results and returns a single reduced future.

When an index task is launched this is the common thing that happens (logic related to the task's speculation have been removed; recall that tasks are speculative operations):

FutureMap Runtime::execute_index_space(Context ctx,
    const IndexLauncher &launcher)
{
...
  IndexTask *task = get_available_index_task();
  FutureMap result = task->initialize_task(ctx, launcher,
      false/*check privileges*/);

  execute_task_launch(ctx, task);

  return result;
}

When an index space task is launched it starts off like any other task via execute_task_launch, and then differs by how it is treated as it progresses through the operational pipeline. The first stop is trigger_dependence_analysis.

Recall that dependence analysis is a serial operation that can be expensive. In order to avoid performing dependence analysis on all tasks represented by the index task launch domain, the runtime takes advantage of the fact that region requirements on the index space task are upper bound requirements on all sub-tasks that will be launched.

void IndexTask::trigger_dependence_analysis(void)
{
  begin_dependence_analysis();
  ...
  register_predicate_dependence();
  RegionTreeContext ctx = parent_ctx->get_context();
  for (unsigned idx = 0; idx < regions.size(); idx++)
  {
    runtime->forest->perform_dependence_analysis(ctx, this, idx, 
        regions[idx], privilege_paths[idx]);
  }
  end_dependence_analysis();
}

Above we see that dependence analysis occurs on each of the regions, rather than all of the tasks that will be launched as a result of executing the index task. Specifically, the runtime will try its best to avoid enumerating all of the tasks in the launch domain until the last possible moment.

Avoiding the enumeration of all points also extends to the mapper interface. That is, instead of mapping each point task separately, the mapper can map the entire index space at once. To see how this works lets look at MultiTask::trigger_execution. Note that the MultiTask is the parent class of both IndexTask and SliceTask. We've been looking at the index task so far, and we'll see what a slice task is momentarily, but note that both share the same implementation of trigger_execution.

Like last time, we'll skip the remote case first and look at the initial case where the task is being launched and trigger_execution will be occurring on a local node.

bool MultiTask::trigger_execution(void)
  //--------------------------------------------------------------------------
{
  bool success = true;
  if (is_remote())
  {
  ...

Here we are at the non-remote case. First we check if the task has pre-mapped, and if it has then we are good (fall through), otherwise pre-mapping needs to execute. Pre-mapping failure will return false to the caller and the task will be restarted later. So, once the task has been pre-mapped we first check if is locally mapped. When thinking about individual tasks it makes sense to map a task locally, but for an index space task, mapping locally requires enumerating all tasks locally and run them. In most cases we try to avoid this, and the code for that case below has been trimmed out.

  else
  {
    // Not remote, make sure it is premapped
    if (is_premapped() || premap_task())
    {
      if (is_locally_mapped())
      {
        ...
      }
      else
      {

Here we are. We are not mapped locally, so the first thing we attempt to do is distribute the task (that is, the mapper might choose to send the task elsewhere). Recall that distribute_task will return true if the task should run locally.

        if (distribute_task())
        {

The runtime maintains the invariant that an index task never leaves the node that it was created on. The version of MultiTask that can move around in the system is the SliceTask. A slice task is effectively identical to an index task, but may represent a subset of the full domain represented by an index task.

So in IndexTask::distribute_task if we send the index task away, then we first clone it as a SliceTask. Since the entire index task is moving, the domain of the slice task will be the entire domain of the index task. Once the slice task arrives on the remote node it will be unpacked and enter the operational pipeline on the remote node as a slice task, but effectively an identical clone to the original index task.

bool IndexTask::distribute_task(void)
{
  ...
  if (!is_sliced() && (target_proc != current_proc))
  {
    // Make a slice copy and send it away
    SliceTask *clone = clone_as_slice_task(index_domain, target_proc,
      true/*needs slice*/, spawn_task, 1LL);

    // Before we do this we have to chain the all children mapped event
    all_children_mapped.trigger(clone->get_children_mapped());

    runtime->send_task(target_proc, clone);
    return false; // We have now been sent away
  } else {
    return true; // Still local so we can be sliced
  }
}

Returning to MultiTask::trigger_execution we have now run distribute_task and we returned true in the case that the index task should be run locally. The next step is to see if the index space has been sliced. If it has not been sliced yet, then we call slice_index_space:

          // Still local try slicing, mapping, and launching
          if (is_sliced())
          {
            success = map_and_launch();
          }
          else
            success = slice_index_space();
        }
      }
    }
    else // failed to premap
      success = false; 
  }

  return success;
} 

The method MultiTask::slice_index_space will communicate with the mapper to construct the set of slices that represent the decomposition of the index space into smaller SliceTask objects. First we set sliced to be true to indicate we've already run this method, and spawn_task to false indicating that this task cannot be stolen (although the new slice tasks that will be created can be stolen). Finally we invoke the mapper to slice the domain of the index space task. The mapper will fill in the splits structure with a set of DomainSplit object:

bool MultiTask::slice_index_space(void)
{
  sliced = true;
  spawn_task = false; // cannot steal something that has been sliced
  std::vector<Mapper::DomainSplit> splits;
  runtime->invoke_mapper_slice_domain(current_proc, this, splits);

Next we create new SliceTask objects for each of the domain splits requested by the mapper. We loop over these splits and clone the index space as a slice task. This clone operation is effectively a 1-1 copy, except for the domain component. The domain component for the slice task will correspond to the current domain split.

All of the new slice tasks are collected in the slices data structure, and then the trigger_slices method is called.

  std::set<Event> all_slices_mapped;
  for (unsigned idx = 0; idx < splits.size(); idx++)
  {
    SliceTask *slice = this->clone_as_slice_task(
            splits[idx].domain,
            splits[idx].proc,
            splits[idx].recurse,
            splits[idx].stealable,
            splits.size());
    all_slices_mapped.insert(slice->get_children_mapped());
    slices.push_back(slice);
  }

  // Trigger our all children mapped as being done when all
  // the slices have all their children mapped
  all_children_mapped.trigger(Event::merge_events(all_slices_mapped));

  bool success = trigger_slices();

  ...

  return success;
}

The trigger_slices method will start the launching of all of the newly created slice tasks. The DeferredSlicer object contains a method that takes the slices as a parameter and then immediately dumps them onto utility processors to extract as much parallelism as possible from this task launch.

bool MultiTask::trigger_slices(void)
{
  DeferredSlicer slicer(this);
  return slicer.trigger_slices(slices);
}

Notice below that the slices object that we filled in is looped over to launch each slice asynchronously:

bool DeferredSlicer::trigger_slices(std::list<SliceTask*> &slices)
{
  std::set<Event> wait_events;
  {
    Processor util_proc = owner->runtime->find_utility_group();
    std::list<SliceTask*>::const_iterator it = slices.begin();
    DeferredSliceArgs args;
    args.hlr_id = HLR_DEFERRED_SLICE_ID;
    args.slicer = this;
    while (true) 
    {
      args.slice = *it;
      it++;
      bool done = (it == slices.end()); 
      Event wait = util_proc.spawn(HLR_TASK_ID, &args, sizeof(args)); 
      if (wait.exists())
        wait_events.insert(wait);
      if (done)
        break;
      }
  }
  ...

Eventually each slice task will make it through the operational pipeline again and return to trigger_execution. The slice task will return through MultiTask::trigger_execution. During the call distribute_task will be called and the task may be moved, and it may also be recursively decomposed, but eventually we'll make it to the point where the task should run locally and is sliced. In this case map_and_launch will be called:

        if (distribute_task())
        {
          // Still local try slicing, mapping, and launching
          if (is_sliced())
          {
            success = map_and_launch();
          }
          else
            success = slice_index_space();
        }
      }

There are two variants of map_and_launch, one for index space tasks and one for slice tasks. The index space task version will be called when slice tasks fail to map in which case the failed slice tasks will be relaunched, so it is relatively simple. The map_and_launch version for slice tasks is where the good stuff is happening.

The first thing that happens is that we enumerate the points in the index space domain that this slice task manages. For each of the points a PointTask is created and mapped. Since slice tasks can fail and be restarted, we check to see if we can avoid re-enumerating the points:

bool SliceTask::map_and_launch(void)
{
  bool map_success = true;

  // Mark that this task is no longer stealable.  Once we start
  // executing things onto a specific processor slices cannot move.
  spawn_task = false;

  // First enumerate all of our points if we haven't already done so
  if (points.empty())
    enumerate_points();

First the mapper is queried to see what (if any) variant of this task should be used when launching the points. Then we iterate over the domain, clone the slice task into a new point task specialized on this particular point, and then add the new slice task to the set of point tasks tracked by this slice task.

Finally we record some metadata about how many of these tasks there are and what stage they are in. This is used for the slice task to manage execution through the pipeline by monitoring the state of its children.

void SliceTask::enumerate_points(void)
{
  runtime->invoke_mapper_select_variant(current_proc, this);

  // Enumerate all the points
  std::set<Event> all_points_mapped;
  for (Domain::DomainPointIterator itr(index_domain); itr; itr++)
  {
    PointTask *next_point = clone_as_point_task(itr.p);
    all_points_mapped.insert(next_point->get_children_mapped());
    points.push_back(next_point);
  }

  // Trigger our all children mapped based on all the points being mapped
  all_children_mapped.trigger(Event::merge_events(all_points_mapped));

  mapping_index = 0;
  // Mark how many points we have
  num_unmapped_points = points.size();
  num_uncomplete_points = points.size();
  num_uncommitted_points = points.size();
} 

After the points are enumerated in map_and_launch the point tasks are mapped. Notice that the first thing that occurs is that a copy of the points structure we filled in when creating the point tasks is made on the stack. This is to address a race condition. When the last point task finishes it will clean up the parent slice task. However, this could actually happen before we even return from this function. Thus we need to keep a copy of state so that if this function is racing with the SliceTask cleanup that we are still seeing valid data. I'm not a fan of this type of race condition. Perhaps this could be fixed with some dependency event chains??? Anyway... the mapping_index state variable is used to track the last point task that was launched to avoid duplicate work if we have to restart some point tasks that fail (we'll come back through this as part of that restart procedure).

bool SliceTask::map_and_launch(void)
{
  ...

  std::vector<PointTask*> local_points(points.size()-mapping_index);
  for (unsigned idx = mapping_index; idx < points.size(); idx++)
    local_points[idx-mapping_index] = points[idx];

  for (std::vector<PointTask*>::const_iterator it = local_points.begin();
      it != local_points.end(); it++)
  {
    PointTask *next_point = *it;
    bool point_success = next_point->perform_mapping();
    if (!point_success)
    {
      map_success = false;    
      break;
    } else

    ...

Mapping a point task is handled by PointTask::perform_mapping. Notice that it is relatively simple, and we've even seen parts of it in past classes. In particular, the heart of this thing is map_all_regions which we previously explored.

bool PointTask::perform_mapping(bool mapper_invoked)
{
  bool map_success = map_all_regions(target_proc, 
            point_termination, mapper_invoked);

  if (map_success && (num_virtual_mappings == 0))
  {
    // Tell our owner that we mapped
    slice_owner->record_child_mapped();

    // Mark that we ourselves have mapped
    complete_mapping();
  }

  return map_success;
}

The remote case for trigger_execution is quite simple once we've seen the local case:

bool MultiTask::trigger_execution(void)
{
  bool success = true;
  if (is_remote())
  {
    // distribute, slice, then map/launch
    if (distribute_task())
    {
      // Still local
      if (is_sliced())
      {
        if (is_locally_mapped())
        {
          launch_task();
        }
        else
        {
          // Try mapping and launching
          success = map_and_launch();
        }
      }
      else
        success = slice_index_space();
      }
    }

  ...

Now let's take a look at the locally mapped case. The mapper in this case has asked the runtime to map the index task on the local node before trying to distribute (i.e. is_locally_mapped is true). Effectively what is happening is the slicing and enumeration prior to distributing versus slicing and distributing then enumerating (the previous cases we've looked at):

        // Not remote, make sure it is premapped
        if (is_premapped() || premap_task())
        {
          if (is_locally_mapped())
          {
            if (is_sliced())
            {
              if (must_epoch != NULL)
                register_must_epoch();
              else
              {
                // See if we're going to send it
                // remotely.  If so we need to do
                // the mapping now.  Otherwise we
                // can defer the mapping until we get
                // on the target processor.
                if (!runtime->is_local(target_proc))
                {
                  if (perform_mapping())
                  {
                    distribute_task();
                  }
                  else // failed to map
                    success = false;
                }
   ...