Here is a very simple single task Legion application that creates an inline mapping. We'll look to see what happens when this mapping becomes too large, and trace through the code associated with the scenario.

The application itself creates a 1D array with a single 64-bit field. The top-level task performs an inline mapping and immediately destroys the mapping and exits.

#include "legion.h"

using namespace LegionRuntime::HighLevel;
using namespace LegionRuntime::Accessor;

enum TaskIDs {
  TOP_LEVEL_TASK_ID,
};

enum FieldIDs {
  FID_X
};

void top_level_task(const Task *task,
    const std::vector<PhysicalRegion>& regions,
    Context ctx, HighLevelRuntime *runtime)
{
  int num_elements = 1<<15;

  // Single "array" with num_elements points
  Rect<1> rect(Point<1>(0), Point<1>(num_elements-1));
  IndexSpace is = runtime->create_index_space(ctx, Domain::from_rect<1>(rect));

  // One field per array element of sizeof(uint64_t) bytes
  FieldSpace fs = runtime->create_field_space(ctx);
  FieldAllocator fa = runtime->create_field_allocator(ctx, fs);
  fa.allocate_field(sizeof(uint64_t), FID_X);

  // Create and map a logical region
  LogicalRegion lr = runtime->create_logical_region(ctx, is, fs);

  RegionRequirement req(lr, READ_WRITE, EXCLUSIVE, lr);
  req.add_field(FID_X);

  InlineLauncher launcher(req);
  PhysicalRegion region = runtime->map_region(ctx, launcher);

  region.wait_until_valid();

  runtime->unmap_region(ctx, region);

  runtime->destroy_logical_region(ctx, lr);
  runtime->destroy_field_space(ctx, fs);
  runtime->destroy_index_space(ctx, is);
}

int main(int argc, char **argv)
{
  HighLevelRuntime::set_top_level_task_id(TOP_LEVEL_TASK_ID);
  HighLevelRuntime::register_legion_task<top_level_task>(TOP_LEVEL_TASK_ID,
      Processor::LOC_PROC, true, false);

  return HighLevelRuntime::start(argc, argv);
}

Notice the line at the beginning of the top-level task that defines the number of elements in the region int num_elements = 1<<15;. The minimum amount of data that Legion should allocate for such a region is 8 * 1<<15 == 262144 bytes. When we run a Legion application we have the option to control the amount of memory available at different levels, and we will use different configurations to create low memory situations.

First we run in a configuration where there is sufficient memory available within the CPUMemory area by restricting it to 1 MB with the -ll:csize option. The -ll:gsize controls the size of the space available in the GASNetMemory.

./example -ll:gsize 32 -ll:csize 1

Since 256K will fit into 1MB, we shouldn't have any problems. We've turned on debugging and selected the relevant parts. What we see here is that a chunk of memory is allocated in mem=60000000 and it matches the size we expect:

[0 - 7f7c88e8e700] {2}{malloc}: alloc partial block: mem=60000000 size=262144 ofs=786432
[0 - 7f7c88e8e700] {2}{inst}: local instance e0000000 created in memory 60000000 at offset 786432 (redop=0 list_size=-1 parent_inst=0 block_size=32768)
[0 - 7f7c88e8e700] {2}{meta}: instance created: region=a0000001 memory=60000000 id=e0000000 bytes=262144
...

Later in the trace the task completes and the memory for the region is reclaimed as well:

...
[0 - 7f7c88e8e700] {2}{meta}: instance destroyed: space=a0000001 id=e0000000
[0 - 7f7c88e8e700] {2}{inst}: destroying local instance: mem=60000000 inst=e0000000
[0 - 7f7c88e8e700] {2}{malloc}: free block: mem=60000000 size=262144 ofs=786432

What is this partial block business? In the Legion memory allocator a free_blocks data structure is searched for a chunk of memory large enough for the region being requested. If a block is found that is exactly the right size, then that case is preferred. Otherwise, a block large enough for the requested region will be used and there will be some leftover space, hence the partial allocation message. Nothing really to worry about.

The next case we'll consider is when the region is too large for CPUMemory which we have restricted to 1MB, but not larger than the 32MB GASNetMemory area. For this we use 1<<18 elements which corresponds to about 2MB of memory. Again we show the relevant parts of the debugging messages, and in fact, we do see a failure:

[0 - 7fa3a5316700] {2}{malloc}: alloc FAILED: mem=60000000 size=2097152
[0 - 7fa3a5316700] {2}{meta}: instance created: region=a0000001 memory=60000000 id=0 bytes=2097152
...

But look, we see a second allocation attempt right after the failure for the exact same size, but in a different memory (mem=607f0000), and this time the allocation is successful.

...
[0 - 7fa3a5316700] {2}{malloc}: alloc partial block: mem=607f0000 size=2097152 ofs=31457280
[0 - 7fa3a5316700] {2}{inst}: local instance e07f0000 created in memory 607f0000 at offset 31457280 (redop=0 list_size=-1 parent_inst=0 block_size=262144)
[0 - 7fa3a5316700] {2}{meta}: instance created: region=a0000001 memory=607f0000 id=e07f0000 bytes=2097152

Later when the memory is freed it references the same memory that the successful allocation had occurred from earlier:

...
[0 - 7fa3a5316700] {2}{meta}: instance destroyed: space=a0000001 id=e07f0000
[0 - 7fa3a5316700] {2}{inst}: destroying local instance: mem=607f0000 inst=e07f0000
[0 - 7fa3a5316700] {2}{malloc}: free block: mem=607f0000 size=2097152 ofs=31457280

Now, what happens if we try to map a region that is larger than the 32MB GASNetMemory? We'll increase the size of the region to be 1<<23 elements, which is ends up being a 64 MB region.

[0 - 7f3d82e7d700] {2}{malloc}: alloc FAILED: mem=60000000 size=67108864
[0 - 7f3d82e7d700] {2}{meta}: instance created: region=a0000001 memory=60000000 id=0 bytes=67108864
...

Here we fail again in the first memory. But what is different is that now we fail in the next attempt in the second memory as well:

...
[0 - 7f3d82e7d700] {2}{malloc}: alloc FAILED: mem=607f0000 size=67108864
[0 - 7f3d82e7d700] {2}{meta}: instance created: region=a0000001 memory=607f0000 id=0 bytes=67108864
...

Don't fear, Legion will retry:

...
[0 - 7f3d82e7d700] {4}{default_mapper}: Notify failed mapping for operation ID 3 in default mapper for processor 80000001! Retrying...
[0 - 7f3d82e7d700] {2}{malloc}: alloc FAILED: mem=60000000 size=67108864
[0 - 7f3d82e7d700] {2}{meta}: instance created: region=a0000001 memory=60000000 id=0 bytes=67108864
[0 - 7f3d82e7d700] {2}{malloc}: alloc FAILED: mem=607f0000 size=67108864
[0 - 7f3d82e7d700] {2}{meta}: instance created: region=a0000001 memory=607f0000 id=0 bytes=67108864
[0 - 7f3d82e7d700] {4}{default_mapper}: Notify failed mapping for operation ID 3 in default mapper for processor 80000001! Retrying...

And will keep trying... But eventually it will give up, with a sort of helpful message:

[0 - 7f3d82e7d700] {5}{default_mapper}: Reached maximum number of failed mappings for operation ID 3 in default mapper for processor 80000001!  Try implementing a custom mapper or changing the size of the memories in the low-level runtime. Failing out ...

Code Walk

How does this work? I mean, how do we get from map_region to the out-of-memory situation? I'm no Legion expert, but I've been studying it quite a bit. I'll provide the general path here, but there is a lot of details I am skipping over.

The first step in our journey is MapOp::trigger_execution which is the execution stage of the pipeline for mapping operations (the magic sauce for inline mappings). How did we get here? That's another story, but just realize that an inline mapping is realized through the execution of a mapping operation. Since a mapping operation is effectively deciding what memory some data should be put it, it first consults the relevant installed mapper through the runtime:

notify = runtime->invoke_mapper_map_inline(local_proc, this);

The map_inline callback of a mapper invoked as a result (in this case we consider the default mapper). The map_inline call is responsible for a number of things, but we consider here its role in providing a preference for what memories to use for the mapping operation. In map_inline we make a call to find_memory_stack of the associated machine. This little thing looks for the memories that are that are visible to a particular processor, and might sort them according to a metric such as latency.

void MachineQueryInterface::find_memory_stack(Processor proc,
    std::vector<Memory> &stack, bool latency)
{
  std::map<Processor,std::vector<Memory> >::iterator finder = proc_mem_stacks.find(proc);
  if (finder != proc_mem_stacks.end()) {
    stack = finder->second;
    if (!latency)
      MachineQueryInterface::sort_memories(machine, proc, stack, latency);
    return;
  }
  MachineQueryInterface::find_memory_stack(machine, proc, stack, latency); 
  proc_mem_stacks[proc] = stack;
  if (!latency)
    MachineQueryInterface::sort_memories(machine, proc, proc_mem_stacks[proc], latency);
}

So, in the case that the stack for the processor isn't found, we just grab the visible memories:

void MachineQueryInterface::find_memory_stack(Machine *machine, 
    Processor proc, std::vector<Memory> &stack, bool latency)
{
  const std::set<Memory> &visible = machine->get_visible_memories(proc);
  stack.insert(stack.end(),visible.begin(),visible.end());
  MachineQueryInterface::sort_memories(machine, proc, stack, latency);
}

Now back in MapOp::trigger_execution we have a set of memories in some order of preference. Next we do two things. First we call runtime->forest->map_physical_region which gives us a MappingRef. The documentation says this about MappingRef:

This class keeps a valid reference to a physical instance that has been allocated and is ready to have dependence analysis performed. Once all the allocations have been performed, then an operation can pass all of the mapping references to the RegionTreeForest to actually perform the operations necessary to make the region valid and return an InstanceRef.

So, based on what I've seen, map_phsyical_region grabs all the necessary resources (like memory) for the region. Later register_physical_region is called and that actually does stuff like moves data around and makes it usable. Anyway... how do we get to those failed memory allocations? We'll hit the out-of-memory error before making it to register_physical_region, so we restart this journey in map_physical_region.

This thing called a region tree is important in Legion. It's a tree. There is a logical tree and a physical tree. I think the physical tree tracks where data is, but I'm still not completely familiar with it. But, we can traverse it. To do that, we use a MappingTraverser, which is a subclass of PathTraverser, and it uses the visitor pattern to implement callbacks for visiting a region and visiting a partition. Here we visit a region:

bool MappingTraverser::visit_region(RegionNode *node)
{
  if (!has_child)
  {
    // Now we're ready to map this instance
    // Separate paths for reductions and non-reductions
    if (!IS_REDUCE(info->req))
    {
      // See if we can get or make a physical instance
      // that we can use
      return map_physical_region(node);
    }
    else
    {
      // See if we can make or use an existing reduction instance
      return map_reduction_region(node);
    }
  }
  else
  {
    // Still not there yet, traverse the node
    traverse_node(node);
    return true;
  } 
}

We are interested in the case where we are mapping a region without reduce privileges. So we call map_physical_region on the node. What is the first thing we do in map_physical_region? The first thing is to get a reference to the memory preferences that the mapper specified:

std::vector<Memory> &chosen_order = info->req.target_ranking;

That's encouraging. There is a bunch of stuff that goes on in here, but it's primarily just a giant for loop that examines each memory specified by the mapper and performs filtering, among other things:

// Go through each of the memories provided by the mapper
for (std::vector<Memory>::const_iterator mit = chosen_order.begin();
    mit != chosen_order.end(); mit++)
{

This is a large a complicated thing, but at the bottom of that for loop we have:

chosen_inst = node->create_instance(*mit, new_fields, 
    blocking_factor, info->mappable->get_depth());
if (chosen_inst != NULL)
{
  // We successfully made an instance
  needed_fields = user_mask;
  break;
}

So we try to create some sort of an instance in a memory, and if successful we break out of the for loop. Now we go down a bit of deep hierarchy. Above, node is type RegionNode and the relevant method is RegionNode::create_instance that turns around and calls create_instance on a FieldSpaceNode.

MaterializedView* RegionNode::create_instance(Memory target_mem,
    const std::set<FieldID> &fields,
    size_t blocking_factor,
    unsigned depth)
{
  InstanceManager *manager = column_source->create_instance(target_mem,
      row_source->domain,
      fields,
      blocking_factor, 
      depth, this);
...

And here we arrive at FieldSpaceNode::create_instance:

InstanceManager* FieldSpaceNode::create_instance(Memory location,
    Domain domain,
    const std::set<FieldID> &create_fields,
    size_t blocking_factor,
    unsigned depth,
    RegionNode *node)
{
...
  if (!inst.exists())
    inst = domain.create_instance(location, field_sizes, blocking_factor);
...

Again, this routine does a bunch of stuff, but eventually ends up (in some cases) calling Domain::create_instance. This puts us down into the low-level parts of Legion, where we will actually deal with some memory objects:

RegionInstance Domain::create_instance(Memory memory,
    const std::vector<size_t> &field_sizes, size_t block_size,
    ReductionOpID redop_id) const
{

Based on the shape of the region and the size of fields various parameters are computed and we ask the memory instance to create for us a new region instance:

RegionInstance i = m_impl->create_instance(get_index_space(),
    linearization_bits, inst_bytes, block_size, elem_size,
    field_sizes, redop_id, -1 /*list size*/, RegionInstance::NO_INST);

This is where we will end our code walk for now. The target memory (e.g. GASNet or local CPU/heap) tries to allocate enough memory for the instance. If it fails it will return RegionInstance::NO_INST to indicate failure.

Now we can unwind the stack all the way back to map_physical_region where we tried to create the instance. If we failed, we'll try the next valid memory in that big for loop:

chosen_inst = node->create_instance(*mit, new_fields, 
    blocking_factor, info->mappable->get_depth());
if (chosen_inst != NULL)
{
  // We successfully made an instance
  needed_fields = user_mask;
  break;
}

From here, map_region will finish mapping region for the task if the memory allocation succeeded. That is for another post.