Introduction

In this article we will discuss how to execute a large number of independent programs in parallel, using a new computer chip capable of executing thousands of independent calculations at once. This may be particularly useful in areas of stochastic modelling and Monte Carlo simulation, where repeated and independent simulations are required to learn about uncertainty in model predictions. The chip in question is called the Intelligence Processing Unit (IPU) and is developed by Graphcore, a UK-based company headquartered in Bristol. The IPU is now (as of February 2023) on its second generation, the Colossus MK2 IPU processor, also known as the GC200.

Although the IPU has many possible applications where its scalability can reduce computational expense, we will use the example of a population dynamics simulation model: a discrete time, discrete state space stochastic process - a pure death process. This is a simple model but one that highlights some key aspects of running programs on the IPU and is an area of modelling that can benefit greatly from the IPU’s design. This model assumes that within a given time period there is a probability, $p$, that one individual in the population dies, otherwise the population remains unchanged. No individuals can be born or migrate into the population.

Before moving on, it is important to note that in this article some familiarity with C++ is assumed, as this is the only language which IPU applications can be written in. However, it is by no means necessary to be an expert in C++ and the most complex topic used here is pointer arithmetic. The article is focused on being able to execute programs on the IPU, as such an understanding of the death process is not necessary but may be advantageous to have a fuller understanding of the code presented, however a brief description of model is given.

A Brief Introduction to the IPU

The GC200 is made of 1,472 independent IPU-Tiles each containing one IPU-Core , possessing 900MB of in-processor memory and capable of executing six independent tasks at once. That is, one GC200 chip can run 8,832 tasks independently without loss of performance. A single chip can be very powerful but the chips were developed with a focus on scalability and offer near seamless performance when used together. This is thanks to the ultra-low latency IPU-Fabric connecting all processors, which is responsible for the transfer of data between IPU-Cores . The processors are available to rent or buy within PODs, starting with four GC200 and increase to have up to 256. The PODs are also designed for scalability and the same ultra-low latency IPU-Fabric which connects chips within the POD is also used to connect to other PODs. One POD consists of both the GC200 processors and one CPU, referred to as the host. The host defines a sequence of instructions to be executed on the GC200 processors, the instructions are written using the Poplar C++ library which was explicitly developed by Graphcore for IPU applications.

When executing a task on an IPU it is best to think of this as a directed graph. Where variables are connected to vertices via edges, the vertices being compute tasks to be executed in parallel and variables being the data used within those tasks. Each vertex is executed on a single thread on an IPU-Core . To gain the best performance from an IPU we wish to have as many compute tasks to execute at once in parallel. A set of compute tasks is referred to as a compute set, both throughout this article and within the poplar programming model. When a compute set is executed the program can not move onto the next task until all tasks in the compute set are completed. As such it is best to have tasks of a similar duration otherwise the IPU-Tiles are stood idle awaiting a single task to finish.

The IPUs already provide a lot of functionality within artificial intelligence and machine learning, and can be used by popular ML packages such as TensorFlow and PyTorch. These, however, will not be discussed here and instead we will focus on writing and executing bespoke C++ programs for the IPU. Running applications on the IPU requires two parts, both need to be written in C++. The first part is the codelet, this is the compute task associated with a vertex. We will show how to execute one codelet repeatedly, however it is not a difficult extension to use different codelets for different vertices. The second part is the main.cpp script which is read and executed by the host, this is where the graph is defined which maps variables to vertices. Graph creation and execution is done using the Poplar library and is done entirely by the host, while the codelet is written in a slightly restricted version of C++.

The Codelet

C++ Restrictions on the codelet

An IPU-Core can compile and read code written in C++, however it does not provide the full functionality of the language. The major difference is that the IPU-Core has no Heap, this means that dynamic memory allocation is not possible. Although this may seem restricting to experienced C++ programmers, if you are new to C++ it makes the learning curve a less steep one.

There are some other important limitations that need to be considered when writing C++ code for an IPU-Core . The libraries available are a small subset of those available on a CPU. However, Graphcore have created several libraries which can be used as well. These provide functionality such as random number generation and linear algebra as well as much more, full documentation for using libraries can be found in the Poplar documentation. It is often the case that problems can be overcome by ‘creative’ programming. For example, there are a limited number of distributions which can be sampled from on an IPU-Core (normal and uniform). The host, however, has full access to all C++ libraries and so samples can be made from any desired distribution using a C++ library and then passed to the IPU-Core.

Another important consideration when writing the codelet is that the compiler must know the size of any array defined at compile time. This may require writing functions to be less general by explicitly stating a dimension that an array is known to be, rather than inferring this from another variable. If the size of an array is not known in advance, defining the array to be ’large enough’ such that it is practically impossible that it would be required to be any larger and dealing with this in a post-hoc manner is an easy solution, albeit expensive in terms of memory requirement.

`codelet.cpp`

The codelet is the script that is executed on a single thread on an IPU-Core . It needs access to all relevant functions, objects, structures etc used within its execution. Variables can be passed to the vertex from the host so that each thread does not have to run with the same parameters, this will be discussed in the main.cpp chapter. For now, we will assume we wish to run every thread with the same parameters and so they can be defined within the codelet.

Here, we are going to use the example of a simple death process. This is only to illustrate the methods required for executing repeated computations on an IPU. The model simulates the scenario where for a time step of size $\Delta t$ there is a probability, $p$, that one member of a population will die otherwise the population remains unchanged. To forward simulate this model we first simulate whether a death occurs within $[0,\Delta)$ and then update the population accordingly. Then simulate whether a death occurs within $[d, 2\Delta)$ and update the population accordingly. This is repeated until the population reached zero and thus no more reactions can occur.

For the IPU to be able to run the codelet we must define a new class which is derived from the Poplar class poplar::Vertex, let’s call the class myVertex. Inside the definition of myVertex is where all relevant objects related to the simulation are defined and nothing should be defined outside of this. Importantly, myVertex must contain a function called compute which returns a true value. This is akin to the main function in standard C++ programming. A codelet example can be seen in Listing 1, where a simple death process is simulated from with an initial population of 100 and a probability of death within $[t, t+\Delta t)$ is 0.1.

#include <ipu_builtins.h>

class myVertex : public poplar::Vertex
{
    public:

    float prob_death = 0.05;
    unsigned int initial_pop = 100;

    float rand_unif01()
    {   /*
        Returns a random number on the interval [0,1]
        of type float.
        */
        return float(__builtin_ipu_urand32()) / float(4294967295) ;
    }

    void sim_death_process(prob_death, init_pop, output_ptr)
    {   /*
        This function simulates a discrete time death process.

        output_ptr: is a pointer to the output array.
        prob_death: defines the probability of a death in the chosen time step.
        init_pop: is the initial population of the system.
        */

        unsigned int current_pop = init_pop ;
        *(output_ptr) = current_pop ;
        unsigned int t = 1;

        while( current_pop > 0 ) {
            if( rand_unif01() < prob_death )
            {
                current_pop-- ;
            }
            *(output_ptr+t) = current_pop ;
            t++ ;
        }
    }

    bool compute()
    {
        const unsigned int MAX_LENGTH = 5000 ; // large number
        unsigned int output[MAX_LENGTH] = {0} ;
        unsigned int* output_ptr = &output[0];

        sim_death_process( prob_death, initial_pop, output_ptr, MAX_LENGTH );

        return true;
    }
};

Listing 1: An codelet.cpp script.

The first thing defined within the myVertex class are the parameters of the model; the probability of death within the time step and the initial population. We then define one function to generate a random number from a Uniform(0,1) distribution. This also shows an example of using a function from a Poplar library, the inclusion of the library can be seen at the top of the script using the #include macro. The popLibs libraries do provide functionality to simulate from a uniform distribution but here we define the uniform in this way as another example of function definition. The function, rand_unif01, is defined to draw a random unsigned int and standardise this by dividing by the maximum value of an unsigned integer, outputting a float in the range $[0,1]$. After this we use the function to simulate the model of interest, sim_death_process.

Lastly we have the compute function which defines the output array and runs the simulation. Also defined here is the MAX_LENGTH variable, as previously mentioned, the IPU must know the exact amount of memory to allocate at compile time. The MAX_LENGTH variable defines the size of the output array and is defined such that we would never expect our simulation output to be larger than MAX_LENGTH. Its value can be chosen by running a few simulations off the IPU and seeing what can be expected from the output.

main.cpp

Unlike the codelet, since the main script is executed on the host, which is a CPU, it has the full functionality of C++, including dynamic memory allocation and all libraries. It is in this script we build and execute the graph: we define and map variables to vertices, attach codelets to those vertices and retrieve output from the IPU.

Creating a Graph

Firstly, a graph object must be created and attached to an IPU. Listing 2 shows some boiler plate code to create a device and attach to numberOfProcs GC200 processor(s) to this device. A device is poplar terminology for the IPU system the graph will be executed on. First a device manager is created, this searches for IPU devices which can be attached to. The final two lines attach the main.cpp script to the device created and create an empty graph on that device. An empty graph is one without any variables or vertices, these will have to added. The code snippet presented here also has several print statements which inform the user if there has been a problem when trying to attach to an IPU device.

// Create the DeviceManager which is used to discover devices (IPUs)
auto manager = DeviceManager::createDeviceManager();
// Attempt to attach to a numberOfProcs IPU(s):
auto devices = manager.getDevices(poplar::TargetType::IPU, numberOfProcs);
std::cout << "Trying to attach to IPU\n";
auto it = std::find_if(devices.begin(), devices.end(), [](Device &device) {
    return device.attach();
});
if (it == devices.end()) {
    std::cerr << "Error attaching to device\n";
    return -1;
}
auto device = std::move(*it);

std::cout << "Attached to IPU " << device.getId() << std::endl;
// Target device so that any graph will be associated with device
Target target = device.getTarget();
// Create the Graph object
Graph graph(target);

Listing 2: Creating an IPU device in Poplar and defining an empty graph object attached to that device.

Before being able to execute or pass parameters to codelets, we need to add the codelet to the graph using the poplar::Graph::addCodelets function. Adding a codelet allows it to then be associated with a vertex and later executed. This can be seen in Listing 3. Also seen in this snippet is the definition of a poplar::program::Sequence object. This is a series of control programs to be executed in order to do things such as map variables to vertices and execute compute sets. These can be defined to be as complex as needed for the task, here we will show an example of a simple one, executing only one compute set. The last line of Listing 3 creates a compute set object and labels this "computeSet". In the next section we will see how to add vertices to this compute set and to the graph itself.

// Add codelets to the graph
graph.addCodelets("myCodelet.cpp");

// Create a control program that is a sequence of steps
poplar::program::Sequence prog;

// Define the compute set to add vertices to later
poplar::ComputeSet computeSet = graph.addComputeSet("computeSet");

Listing 3: Adding a codelet script to the graph, creating a control sequence object and defining an empty compute set.

Mapping Vertices to Tiles

To be able to execute a compute set it is necessary to first map vertices to specific tiles on which they will be executed. This is done explicitly by connecting a vertex on the graph to a tile index. A single GC200 processor has 1,472 tiles each with a unique index, in the range [0, 1,471]. It should not matter which tile a vertex is mapped to as they are identical and can share information with ease. Suppose we wish to map a vertex, myVertex, to all tiles on a GC200. Adding the vertex to a compute set informs poplar which vertices can be executed in parallel. The function also creates a vertex reference which is used to map the vertex to a tile. Using the vertex reference, it is then mapped to a specific tile using the poplar::Graph::setTileMapping function, with the reference as its first parameter and the tile index as its second. By doing this repeatedly a vertex can be mapped to every tile on the chip, as seen in Listing 4.

const unsigned int numberOfTiles = 1472; // maximum is 1472

for( std::size_t i=0; i<numberOfTiles; ++i){
    VertexRef vtx = graph.addVertex(computeSet, "myVertex");
    graph.setTileMapping(vtx, i);
}

Listing 4: Mapping a vertex to evey tile on a GC200.

However, mapping a single vertex to every tile on a GC200 would only be using one sixth of the processor computing power, as each tile can execute six tasks independently. It would therefore be good to be able to map multiple tasks to a single tile, so that they can be executed on different threads on that tile. This can be easily done by mapping six (the maximum number of threads) vertices to every tile, the IPU-Core will automatically split the vertices between threads. The first two lines of the for loop ensure that the tile index does not exceed the number maximum, 1,472, and that each tile is mapped to numberOfThreads times. The snippet maps a single vertex to every thread on the chip, it is possible to execute more vertices than there are threads by mapping more than six vertices to a single tile.

const unsigned int numberOfTiles = 1472; // maximum is 1472
const unsigend int numberOfThreads = 6; // maximum is 6

const unsigend int totalThreads = numberOfThreads * numberOfTiles ;

for( std::size_t i=0; i<totalThreads; ++i){
    int roundCount = i % int( numberOfTiles*threadsPerTile );
    int tileInt = std::floor( float(roundCount) / float(threadsPerTile) );

    VertexRef vtx = graph.addVertex(computeSet, "myVertex");
    graph.setTileMapping(vtx, tileInt);
}

Listing 5: Mapping one vertex to every thread on a every tile on one GC200.

In practise there will be more than one GC200 processor available, as they come in IPU-PODs. We therefore wish to be able to map vertices to all available processors. When using numberOfProcs processors each tile still has a unique index with a range $[0, 1471\times$numberOfProcs$]$. Listing 6 shows how to define the tileInt parameter to get the desired output. Six vertices are still being mapped to every tile but now all IPU-Tiles within an IPU-POD4 system are being used.

const unsigned int numberOfTiles = 1472; // maximum is 1472
const unsigend int numberOfThreads = 6; // maximum is 6
const unsigend int numberOfProcs = 4; // maximum depends on POD size

const unsigend int totalThreads = numberOfThreads * numberOfTiles * numberOfProcs ;

for( std::size_t i=0; i<totalThreads; ++i){
    int roundCount = i % int( numberOfTiles * threadsPerTile * numberOfProcs);
    int tileInt = std::floor( float(roundCount) / float(threadsPerTile) );

    VertexRef vtx = graph.addVertex(computeSet, "myVertex");
    graph.setTileMapping(vtx, tileInt);
}

Listing 6: Mapping one vertex to every thread on every tile across multiple GC200 chips.

Passing Parameters

Being able to pass parameters to vertices would allow calculations to be executed for different data or parameter values. Or, if one task is dependent on the previous, it also allows calculation of variables in one compute set and then be passed forward to the next compute set. Passing parameters to a vertex is very similar to mapping a vertex to a tile, except here data is being mapped to a tile and then connected to a variable within a codelet.

When passing a variable to a codelet, the variable type inside the vertex must reflect the fact it is going to be mapped to the codelet. This is done simply by declaring an object as type poplar::Input<T>. Listing 7 shows an example of two variables to be mapped, one an integer scalar and the other a vector of floating points.

class myVertex : public poplar::Vertex
{
    public:
    poplar::Input<int> myScalar ;
    poplar::Input<poplar::Vector<float>> myVector ;

    // ... the rest of the vertex object
}

Listing 7: Declaring an object is to be passed to a codelet.

Scalars

A scalar is a single value (a single number). Passing scalars is relatively easy however the number of parameters usually increases with model complexity and so this is only practical for simple models. For large models it is often more convenient to pass a vector of parameters, which may also be preferred if the variable is naturally described as a vector. The method shown here follows most of the tutorials on the Graphcore website, these can be found here for further information. We present a general case where a different value can be passed to each tile. First, we define a one dimensional array to store the values being passed, this is done in standard C++. Once this is created we add a poplar::Tensor to the graph, only a poplar::Tensor type can be used when adding variables to a graph. The poplar::Tensor object is very similar to the C++ std::vector object. The documentation can be found here, for all details on functions and operators available.

Listing 8 shows the creation of C++ array, myScalars, which is then added to the graph as a poplar::Tensor using the poplar::Graph::addConstant function. The function also informs poplar of the type of the elements within the tensor as well as its size and shape. Here, the tensor is declared with elements of the poplar type FLOAT and it is a single dimension tensor of length totalThreads.

float myScalars[totalThreads];
for( std::size_t i=0; i<totalThreads; ++i ){
    myScalars[i] = i*0.001 ; // or something useful
}

poplar::Tensor myScalars_tensor = graph.addConstant<float>(FLOAT, {totalThreads}, myScalars);

Listing 8: Creating a Tensor to be added to the graph as a constant.

Now, we need to define which element of the tensor is being mapped to which tile. We do this using the poplar::Grpah::setTileMapping function. The second element of the function is the tile index we wish to map to. The last thing to do is inform poplar which variable inside the codelet is being mapped to. The variable on the graph can ‘connect’ to a parameter in the codelet via the poplar::Graph::connect function. These steps can all be seen in Listing 9, where the calculation of tileInt has been omitted. Note the mapping of the vertices is included in this snippet as it is necessary that the vertex is mapped to be able to connect a variable to a parameter within it.

for( std::size_t i=0; i<totalThreads; ++i){
    // ... Calculate tileInt
    graph.setTileMapping(myScalars_tensor[i], tileInt);

    graph.addVertex(computeSet, "myVertex");
    graph.setTileMapping(vtx, tileInt);

    graph.connect(vtx["myScalar"], myScalars_tensor[i]);
}

Listing 9: Mapping a scalar to every thread to be used in a codelet.

This method is simple to implement but if the codelet requires vector or matrix input then each element would have to be mapped separately and the vector/matrix reconstructed within the codelet. This quickly becomes tedious to write and not easy to read.

Vectors (1-Dimensional Arrays)

Here, we discuss how to stream a vector to tiles on the IPU, this creates and stores the vector on the host before it is passed to each tile via a data stream. This is not the only way in which to pass a vector to a vertex but it was useful and it also illustrates another functionality of an IPU-POD. Another way in which tensors can be mapped to a tile is by first storing them on the tiles themselves rather than on the host, as part of the graph. A tensor can be stored on a single tile or across many, if it is very large. This method will not be covered here, although the method is similar to the ones that have been covered. This short video tutorial gives a brief example of how this, and graph creation and execution is done.

Here we will show how to stream a single vector to every vertex although this method can be extended to pass different vectors to each vertex. Similar to before, we start by defining a vector to be streamed and adding this to graph. However, when streaming we use the poplar::Graph::addVariable function not addConstant. The three parameters of the function used here, are the type of elements which fill the poplar::Tensor, the size and shape of the tensor and finally the label. Although we can define a multidimensional tensor and add this to the graph, only a vector can be streamed to an IPU-Tile . This may seem restrictive but it is not a difficult problem to overcome, if desired a multidimensional array can be flattened into a single dimensional tensor and then re-shaped into the multidimensional array on the IPU-Tile. C++ stores multidimensional arrays in contiguous blocks, as though they were flattened anyway.

const unsigned int myVector_size = 10;
float myVector[myVector_size];
for( std::size_t i=0; i<myVector_size; ++i )
    myVector[i] = i*0.0001; // or something useful

poplar::Tensor myVector_tensor = graph.addVariable(FLOAT, {myVector_size}, "myVector");

Listing 10: Defining a Tensor and added it to the graph as a variable.

After the definition, similar to before we need to create the mapping to a specific tile and specific codelet vertex on that tile. This is done exactly the same as before.

// Map tensors to tiles
for( std::size_t i=0; i<totalThreads; ++i ){
    // ... Calculate tileInt
    graph.setTileMapping(myVector, tileInt);

    VertexRef vtx = graph.addVertex(computeSet, "myVertex");
    graph.setTileMapping(vtx, tileInt);

    graph.connect(vtx["myVector"], myVector_tensor);
}

auto myVector_stream = graph.addHostToDeviceFIFO("write_myVector", FLOAT, myVector_size);

Listing 11: Mapping a vector to a vertex through stream.

To be able to stream information from host to tile the stream object must be created, using the poplar::Graph::addHostToDeviceFIFO function. The first argument of this function is the stream label then the type of the elements being streamed and the number of elements. There is one last thing that must be done for the tensor to be streamed from host to tile.

Before this is done, we first introduce the poplar::Engine object. The engine represents the whole graph and execution sequence on a given device. Its creation and eventual execution will be covered later but for now assume we have a poplar::Engine object called engine. The function poplar::Engine::connectStream is used to let the engine know there is a stream that needs to be executed. Listing 12 shows an example of this. The first parameter is the stream label, as defined in Listing 11 and the other two parameters are the memory addresses of the first and last elements of the array.

// Attach the data stream to the engine so that the stream is executed
engine.connectStream("write_myVector", &myVector[0], &myVector[0]+myVector_size);

Listing 12: Adding the stream to be executed by the Poplar engine.

Retrieving Output

Retrieving output from the codelet is very similar to the process of streaming vectors to the codelet. First we have to inform the codelet that we are expecting an output array to be streamed from the vertex, we do this using the type poplar::Output<T>. Doing this allows us to write the output directly into the poplar::Output object and we do not have to define an output array within the compute function, like was done in Listing 13.

class myVertex : public poplar::Vertex
{
    public:
    poplar::Output<vector<float>> out;
    // ...
}

Listing 13: Defining an object to be outputed from a vertex.

Now we must define an output array and map it to the tiles and vertex, similarly to when streaming a vector to the vertex. Here, however, there is a convenient shortcut to be able to stream output from tiles to host. The function poplar::Graph::createHostRead creates and connects the streaming object for us. Listing 14 shows how to implement this. The variable output_size is the known size of the output vector, for the death process example this would be MAX_LENGTH.

poplar::Tensor output = graph.addVariable(FLOAT, {totalThreads, output_size}, "output");

for( std::size_t i = 0; i < totalThreads; ++i ){
    // ... Calculate tileInt
    graph.setTileMapping(output[i], tileInt);

    VertexRef vtx = graph.addVertex(computeSet, "myVertex");
    graph.setTileMapping(vtx, tileInt);

    graph.connect(vtx["out"], output[i]);
 }
 graph.createHostRead("output-read", output);

Listing 14: Mapping an output vector from a tensor on the host to a vertex and creating the a stream object using createHostRead.

A small thing to note is that the output tensor is multidimensional but only a single dimension is streamed from the codelet. As mentioned, we cannot stream multidimensional tensors directly to or from the codelet, but here is an example of how we can stream sections (slices) of a tensor from the vertex as long as the slice itself only has one dimension. This same method can be used when passing variables to a vertex.

Once the graph has been executed (this will be discussed in the next section) it is likely we wish to read the output of the of the computations. To do this there is another helpful function, popolar::Engine::readTensor, with the first parameter being the name of the streaming object, defined in the createHostRead function and the following two parameters being the first and last memory address of where to save the output on the host. Listing 15 gives an example of this. The output does not have to be read into a standard C++ array, as we are reading the output to the host we have access to all C++ libraries so the output can be read into other formats if desired e.g. std::vector.

float output_array[totalThreads * output_size] ;
engine.readTensor("output-read", &output_array[0] &output_array[0]+output_size)

Listing 15: Reading a tensor from the IPU onto the host.

Graph Execution

Lastly, we need to execute the graph and compute set. However, before this we must let poplar know what order to execute control programs, for this we add to prog, the poplar::program::Sequence object defined in Listing 3. For a simple graph, with no data streams and one compute set. This is relatively easy and is done using the function poplar::program::add with the parameter being the compute set executed with poplar::program::Execute function.

// Adding the execution of the compute set to the sequence of programs
prog.add(Execute(computeSet));

Listing 17: Adding an execute compute set command to the control sequence.

Hopefully, truly the last thing to do is to execute the entire graph in the order defined by prog. This requires the use of the poplar::Engine object. The engine is the combination of all we’ve done so far; the device, the graph, sequence of control programs and data streams. An poplar::Engine object is created by passing a poplar::Graph object and a poplar::Sequence object to the Engine constructor. The engine is then loaded onto a device and ran, see Listing 18. For illustrative purpose in this snippet, although commented out, a stream connection has been added to the engine.

// Create the engine
Engine engine(graph, prog);
engine.load(device);
// Add the host stream to the execution pipe
// engine.connectStream( ... );
engine.run();

Listing 18: Running the engine - executing the control sequence and any anything else added to the engine.

Complete Example

To bring together what has been discussed in the previous sections we will construct the code to stream two parameters to the death process codelet example used earlier. First we must define the parameters in the codelet which are to be passed to the vertex. For this, we change the types of the prob_death and initial_pop variables. The codelet has also been updated to write output directly into the out vector and comments have been removed.

class myVertex : public poplar::Vertex
{
    public:
    poplar::Input<float> prob_death;
    poplar::Input<int> initial_pop;
    poplar::Output<poplar::Vector<int>> out;

    unsigned int MAX_LENGTH = 5000;

    float runif_01(){
        return float(__builtin_ipu_urand32()) / float(4294967295) ;
    }

    void death_process(float prob_death, int init_pop){
        int current_pop = init_pop ;
        out[0] = current_pop ;
        int t = 1;
        while( current_pop > 0 && t < MAX_LENGTH ) {
            if( runif_01() < prob_death ){
                current_pop-- ;
            }
            out[t] = current_pop ;
            t++ ;
        }
    }

    bool compute () {
        death_process(prob_death, init_pop);
        return true;
    }
}

Listing 19: Codelet example.

Then in the main.cpp file we can define the variables, create the mappings and vertices to specific tiles and connect the variables to those in the codelet. The mappings for all vertices and variables can be done within a single for loop, as seen in Listing 9. This example maps a single vertex to every thread on every tile on an IPU-POD4. The graph and control sequence are then executed before the output is streamed back to the host and the saved in a std::vector called cpu_vector.

int main () {
    unsigned int numberOfThreads = 6;
    unsigned int numberOfTiles = 1472;
    unsigned int numberOfProcs = 4;

    unsigned int totalThreads = numberOfThreads * numberOfTiles * numberOfProcs ;

    // Create the DeviceManager which is used to discover devices (IPUs)
    auto manager = DeviceManager::createDeviceManager();
    // Attempt to attach to a numberOfCores IPU(s):
    auto devices = manager.getDevices(poplar::TargetType::IPU, numberOfCores);
    std::cout << "Trying to attach to IPU\n";
    auto it = std::find_if(devices.begin(), devices.end(), [](Device &device) {
            return device.attach();
    });
    if (it == devices.end()) {
            std::cerr << "Error attaching to device\n";
            return -1;
    }
    auto device = std::move(*it);

    std::cout << "Attached to IPU " << device.getId() << std::endl;
    // target device so that any graph will be associated with device
    Target target = device.getTarget();
    // Create the Graph object
    Graph graph(target);
    // Add codelets to the graph
    graph.addCodelets("myCodelet_simple.cpp");

    // Create a control program that is a sequence of steps
    poplar::program::Sequence prog;

    const unsigned int MAX_LENGTH = 5000;

    float prob_death[totalThreads];
    unsigned int init_pop[totalThreads];

    for( int i=0; i<totalThreads; ++i ){
        prob_death[i] = 0.1 ;
        init_pop[i] = 100 ;
    }

    poplar::Tensor prob = graph.addConstant<float>(FLOAT, {totalThreads}, prob_death) ;
    poplar::Tensor init = graph.addConstant<float>(FLOAT, {totalThreads}, init_pop) ;
    poplar::Tensor output = graph.addVariable(FLOAT, {totalThreads, MAX_LENGTH}, "output") ;

    ComputeSet computeSet = graph.addComputeSet("computeSet");

    // Map tensors to tiles
    for(int i=0; i<totalThreads; ++i){
        int roundCount = i % int( numberOfTiles * threadsPerTile * numberOfProcs);
        int tileInt = std::floor( float(roundCount) / float(threadsPerTile) );

        graph.setTileMapping(prob[i], tileInt);
        graph.setTileMapping(init[i], tileInt);
        graph.setTileMapping(output[i], tileInt);

        // Add the codelet vertex to the graph
        VertexRef vtx = graph.addVertex(computeSet, "myVertex");
        // map the vertex to every tile
        graph.setTileMapping(vtx, tileInte);

        // Connect the parameter names in the codelet to the Tensors defined here
        graph.connect(vtx["prob_death"], prob[i]);
        graph.connect(vtx["init_pop"], init[i]);
        graph.connect(vtx["out"], output[i]);
    }

    graph.connectHostRead("output-read", output);

    prog.add(Execute(computeSet));
    prog.add(PrintTensor("output", output));

    // Create the engine
    Engine engine(graph, prog);
    engine.load(device);
    engine.run();

    // Read the output tensor onto the host as a std::vector object
    std::vector<int> cpu_vector( totalThreads * MAX_LENGTH ) ;
    engine.readTensor("output-read", cpu_vector.data(), cpu_vector.data()+cpu_vector.size());

    return 0;
}

Listing 20: Main script example.

Conclusion

Although the development time to be able to execute tasks on an IPU can be quite significant, it can offer significant computational savings where repeated independent tasks need to be executed and hopefully this introduction will help with some of the initial problems. Graphcore have created many tutorials, which are available on their GitHub, although many of these are directed at using PyTorch and TensorFlow. Whereas the pipeline here allows the user to be able to write their own bespoke codelets to executed en masse.

The Poplar programming model and IPU offer far more functionality than that is presented here and this is only a small introduction to their use. For examples of IPU uses see Graphcore blogs and to see for the full documentation of the Poplar libraries see here.

Introduction#

A Brief Introduction to the IPU#

The Codelet#

C++ Restrictions on the codelet#

codelet.cpp#

main.cpp#

Creating a Graph#

Mapping Vertices to Tiles#

Passing Parameters#

Scalars#

Vectors (1-Dimensional Arrays)#

Retrieving Output#

Graph Execution#

Complete Example#

Conclusion#