StarPU Handbook
Loading...
Searching...
No Matches
11. Tasks In StarPU

11.1 Task Granularity

Similar to other runtimes, StarPU introduces some overhead in managing tasks. This overhead, while not always negligible, is mitigated by its intelligent scheduling and data management capabilities. The typical order of magnitude for this overhead is a few microseconds, which is notably smaller than the inherent CUDA overhead. To ensure that this overhead remains insignificant, the work assigned to a task should be substantial enough.

The length of tasks should ideally be relatively larger to effectively counterbalance this overhead. It iss advised to consider the offline performance feedback, which provides insights into task lengths. Monitoring task lengths becomes crucial if you're encountering suboptimal performance.

To gauge the scalability potential based task size, you can run the tests/microbenchs/tasks_size_overhead.sh script. It provides a visual representation of the speedup achievable with independent tasks of very small sizes.

This benchmark is installed in $STARPU_PATH/lib/starpu/examples/. It gives a glimpse into how long a task should be (in µs) for StarPU overhead to be low enough to keep efficiency. The script generates a plot illustrating the speedup trends for tasks of different sizes, correlated with the number of CPUs in use.

For example, in the figure below, for 128 µs tasks (the red line), StarPU overhead is low enough to guarantee a good speedup if the number of CPUs is not more than 36. But with the same number of CPUs, 64 µs tasks (the black line) cannot have a correct speedup. The number of CPUs must be decreased to about 17 in order to keep efficiency.

To determine the task size your application is using, it is possible to use starpu_fxt_data_trace as explained in Data trace and tasks length.

The selection of a scheduler in StarPU also plays a significant role. Different schedulers have varying impacts on the overall execution. For example, the dmda scheduler may require additional time to make decisions, while the eager scheduler tends to be more immediate in its decisions.

To assess the impact of scheduler choice on your target machine, you can once again utilize the tasks_size_overhead.sh script. This script provides valuable insights into how different schedulers affect performance in conjunction with task sizes.

11.2 Task Submission

To enable StarPU to perform online optimizations effectively, it is recommended to submit tasks asynchronously whenever possible. The goal is to maximize the level of asynchronous submission, allowing StarPU to have more flexibility in optimizing the scheduling process. Ideally, all tasks should be submitted asynchronously, and the use of functions like starpu_task_wait_for_all() or starpu_data_unregister() should be limited to waiting for task completion.

StarPU will then be able to rework the whole schedule, overlap computation with communication, manage accelerator local memory usage, etc. A simple example is in the file examples/basic_examples/variable.c

11.3 Task Priorities

StarPU's default behavior considers tasks in the order they are submitted by the application. However, in scenarios where the application programmer possesses knowledge about certain tasks that should take priority due to their impact on performance (such as tasks whose output is crucial for subsequent tasks), the starpu_task::priority field can be utilized to convey this information to StarPU's scheduling process.

An example is provided in the application examples/heat/dw_factolu_tag.c.

11.4 Setting Many Data Handles For a Task

The maximum number of data that a task can manage is fixed by the macro STARPU_NMAXBUFS. This macro has a default value which can be customized through the configure option --enable-maxbuffers.

However, if you have specific cases where you need tasks to manage more data than the maximum allowed, you can use the field starpu_task::dyn_handles when defining a task, along with the field starpu_codelet::dyn_modes when defining the corresponding codelet.

This dynamic handle mechanism enables tasks to handle additional data beyond the usual limit imposed by STARPU_NMAXBUFS.

{
};
struct starpu_codelet dummy_big_cl =
{
.cuda_funcs = { dummy_big_kernel },
.opencl_funcs = { dummy_big_kernel },
.cpu_funcs = { dummy_big_kernel },
.cpu_funcs_name = { "dummy_big_kernel" },
.nbuffers = STARPU_NMAXBUFS+1,
.dyn_modes = modes
};
task->cl = &dummy_big_cl;
task->dyn_handles = malloc(task->cl->nbuffers * sizeof(starpu_data_handle_t));
for(i=0 ; i<task->cl->nbuffers ; i++)
{
task->dyn_handles[i] = handle;
}
enum starpu_data_access_mode modes[STARPU_NMAXBUFS]
Definition starpu_task.h:542
starpu_cuda_func_t cuda_funcs[STARPU_MAXIMPLEMENTATIONS]
Definition starpu_task.h:429
struct starpu_task * starpu_task_create(void) STARPU_ATTRIBUTE_MALLOC
int starpu_task_submit(struct starpu_task *task)
#define STARPU_NMAXBUFS
Definition starpu_config.h:238
Definition starpu_task.h:338
starpu_data_access_mode
Definition starpu_data.h:56
struct _starpu_data_state * starpu_data_handle_t
Definition starpu_data.h:45
@ STARPU_R
Definition starpu_data.h:58
starpu_data_handle_t *handles = malloc(dummy_big_cl.nbuffers * sizeof(starpu_data_handle_t));
for(i=0 ; i<dummy_big_cl.nbuffers ; i++)
{
handles[i] = handle;
}
starpu_task_insert(&dummy_big_cl,
STARPU_VALUE, &dummy_big_cl.nbuffers, sizeof(dummy_big_cl.nbuffers),
STARPU_DATA_ARRAY, handles, dummy_big_cl.nbuffers,
0);
int nbuffers
Definition starpu_task.h:531
#define STARPU_DATA_ARRAY
Definition starpu_task_util.h:95
int starpu_task_insert(struct starpu_codelet *cl,...)
#define STARPU_VALUE
Definition starpu_task_util.h:45

The whole code for this complex data interface is available in the file examples/basic_examples/dynamic_handles.c.

11.5 Setting a Variable Number Of Data Handles For a Task

Normally, the number of data handles given to a task is set with starpu_codelet::nbuffers. This field can however be set to STARPU_VARIABLE_NBUFFERS, in which case starpu_task::nbuffers must be set, and starpu_task::modes (or starpu_task::dyn_modes, see Setting Many Data Handles For a Task) should be used to specify the modes for the handles. Examples in examples/basic_examples/dynamic_handles.c show how to implement it.

11.6 Insert Task Utility

StarPU provides the wrapper function starpu_task_insert() to ease the creation and submission of tasks.

Here is the implementation of a codelet:

void func_cpu(void *descr[], void *_args)
{
int *x0 = (int *)STARPU_VARIABLE_GET_PTR(descr[0]);
float *x1 = (float *)STARPU_VARIABLE_GET_PTR(descr[1]);
int ifactor;
float ffactor;
starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
*x0 = *x0 * ifactor;
*x1 = *x1 * ffactor;
}
struct starpu_codelet mycodelet =
{
.cpu_funcs = { func_cpu },
.cpu_funcs_name = { "func_cpu" },
.nbuffers = 2,
.modes = { STARPU_RW, STARPU_RW }
};
starpu_cpu_func_t cpu_funcs[STARPU_MAXIMPLEMENTATIONS]
Definition starpu_task.h:414
#define STARPU_VARIABLE_GET_PTR(interface)
Definition starpu_data_interfaces.h:2211
@ STARPU_RW
Definition starpu_data.h:60
void starpu_codelet_unpack_args(void *cl_arg,...)

And the call to starpu_task_insert():

starpu_task_insert(&mycodelet,
STARPU_VALUE, &ifactor, sizeof(ifactor),
STARPU_VALUE, &ffactor, sizeof(ffactor),
STARPU_RW, data_handles[0],
STARPU_RW, data_handles[1],
0);

The call to starpu_task_insert() is equivalent to the following code:

task->cl = &mycodelet;
task->handles[0] = data_handles[0];
task->handles[1] = data_handles[1];
char *arg_buffer;
size_t arg_buffer_size;
starpu_codelet_pack_args(&arg_buffer, &arg_buffer_size,
STARPU_VALUE, &ifactor, sizeof(ifactor),
STARPU_VALUE, &ffactor, sizeof(ffactor),
0);
task->cl_arg = arg_buffer;
task->cl_arg_size = arg_buffer_size;
int ret = starpu_task_submit(task);
void * cl_arg
Definition starpu_task.h:856
struct starpu_codelet * cl
Definition starpu_task.h:717
size_t cl_arg_size
Definition starpu_task.h:873
starpu_data_handle_t handles[STARPU_NMAXBUFS]
Definition starpu_task.h:798
Definition starpu_task.h:688
void starpu_codelet_pack_args(void **arg_buffer, size_t *arg_buffer_size,...)

In the example file tests/main/insert_task_value.c, we use these two ways to create and submit tasks.

Instead of calling starpu_codelet_pack_args(), one can also call starpu_codelet_pack_arg_init(), then starpu_codelet_pack_arg() for each data, then starpu_codelet_pack_arg_fini() as follow:

task->cl = &mycodelet;
task->handles[0] = data_handles[0];
task->handles[1] = data_handles[1];
starpu_codelet_pack_arg(&state, &ifactor, sizeof(ifactor));
starpu_codelet_pack_arg(&state, &ffactor, sizeof(ffactor));
int ret = starpu_task_submit(task);
void starpu_codelet_pack_arg(struct starpu_codelet_pack_arg_data *state, const void *ptr, size_t ptr_size)
void starpu_codelet_pack_arg_init(struct starpu_codelet_pack_arg_data *state)
void starpu_codelet_pack_arg_fini(struct starpu_codelet_pack_arg_data *state, void **cl_arg, size_t *cl_arg_size)
Definition starpu_task_util.h:546

A full code example is in file tests/main/pack.c.

Here a similar call using STARPU_DATA_ARRAY.

starpu_task_insert(&mycodelet,
STARPU_DATA_ARRAY, data_handles, 2,
STARPU_VALUE, &ifactor, sizeof(ifactor),
STARPU_VALUE, &ffactor, sizeof(ffactor),
0);

If some part of the task insertion depends on the value of some computation, the macro STARPU_DATA_ACQUIRE_CB can be very convenient. For instance, assuming that the index variable i was registered as handle A_handle[i]:

/* Compute which portion we will work on, e.g. pivot */
starpu_task_insert(&which_index, STARPU_W, i_handle, 0);
/* And submit the corresponding task */
starpu_task_insert(&work, STARPU_RW, A_handle[i], 0));
#define STARPU_DATA_ACQUIRE_CB(handle, mode, code)
Definition starpu_data.h:398
@ STARPU_W
Definition starpu_data.h:59

The macro STARPU_DATA_ACQUIRE_CB submits an asynchronous request for acquiring data i for the main application, and will execute the code given as the third parameter when it is acquired. In other words, as soon as the value of i computed by the codelet which_index can be read, the portion of code passed as the third parameter of STARPU_DATA_ACQUIRE_CB will be executed, and is allowed to read from i to use it e.g. as an index. Note that this macro is only available when compiling StarPU with the compiler gcc. In the example file tests/datawizard/acquire_cb_insert.c, this macro is used.

StarPU also provides a utility function starpu_codelet_unpack_args() to retrieve the STARPU_VALUE arguments passed to the task. There is several ways of calling starpu_codelet_unpack_args(). The full code examples are available in the file tests/main/insert_task_value.c.

void func_cpu(void *descr[], void *_args)
{
int ifactor;
float ffactor;
starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
}
void func_cpu(void *descr[], void *_args)
{
int ifactor;
float ffactor;
starpu_codelet_unpack_args(_args, &ifactor, 0);
starpu_codelet_unpack_args(_args, &ifactor, &ffactor);
}
void func_cpu(void *descr[], void *_args)
{
int ifactor;
float ffactor;
char buffer[100];
starpu_codelet_unpack_args_and_copyleft(_args, buffer, 100, &ifactor, 0);
starpu_codelet_unpack_args(buffer, &ffactor);
}
void starpu_codelet_unpack_args_and_copyleft(void *cl_arg, void *buffer, size_t buffer_size,...)

Instead of calling starpu_codelet_unpack_args(), one can also call starpu_codelet_unpack_arg_init(), then starpu_codelet_pack_arg() or starpu_codelet_dup_arg() or starpu_codelet_pick_arg() for each data, then starpu_codelet_unpack_arg_fini() as follow:

void func_cpu(void *descr[], void *_args)
{
int ifactor;
float ffactor;
size_t size = sizeof(int) + 2*sizeof(size_t) + sizeof(int) + sizeof(float);
starpu_codelet_unpack_arg_init(&state, _args, size);
starpu_codelet_unpack_arg(&state, (void**)&ifactor, sizeof(ifactor));
starpu_codelet_unpack_arg(&state, (void**)&ffactor, sizeof(ffactor));
}
void starpu_codelet_unpack_arg(struct starpu_codelet_pack_arg_data *state, void *ptr, size_t size)
void starpu_codelet_unpack_arg_init(struct starpu_codelet_pack_arg_data *state, void *cl_arg, size_t cl_arg_size)
void starpu_codelet_unpack_arg_fini(struct starpu_codelet_pack_arg_data *state)
void func_cpu(void *descr[], void *_args)
{
int *ifactor;
float *ffactor;
size_t size;
size_t psize = sizeof(int) + 2*sizeof(size_t) + sizeof(int) + sizeof(float);
starpu_codelet_unpack_arg_init(&state, _args, psize);
starpu_codelet_dup_arg(&state, (void**)&ifactor, &size);
assert(size == sizeof(*ifactor));
starpu_codelet_dup_arg(&state, (void**)&ffactor, &size);
assert(size == sizeof(*ffactor));
}
void starpu_codelet_dup_arg(struct starpu_codelet_pack_arg_data *state, void **ptr, size_t *size)
void func_cpu(void *descr[], void *_args)
{
int *ifactor;
float *ffactor;
size_t size;
size_t psize = sizeof(int) + 2*sizeof(size_t) + sizeof(int) + sizeof(float);
starpu_codelet_unpack_arg_init(&state, _args, psize);
starpu_codelet_pick_arg(&state, (void**)&ifactor, &size);
assert(size == sizeof(*ifactor));
starpu_codelet_pick_arg(&state, (void**)&ffactor, &size);
assert(size == sizeof(*ffactor));
}
void starpu_codelet_pick_arg(struct starpu_codelet_pack_arg_data *state, void **ptr, size_t *size)

During unpacking one can also call starpu_codelet_unpack_discard_arg() to skip saving the argument in pointer.

A full code example is in file tests/main/pack.c.

11.7 Other Task Utility Functions

Here a list of other functions to help with task management.

  • The function starpu_task_dup() creates a duplicate of an existing task. The new task is identical to the original task in terms of its parameters, dependencies, and execution characteristics.
  • The function starpu_task_set() is used to set the parameters of a task before it is executed, while starpu_task_build() is used to create a task with the specified parameters.

StarPU provides several functions to help insert data into a task. The function starpu_task_insert_data_make_room() is used to allocate memory space for a data structure that is required for inserting data into a task. This function is called before inserting any data handles into a task, and ensures that enough memory is available for the data to be stored. Once memory is allocated, the data handle can be inserted into the task using the following functions

  • starpu_task_insert_data_process_arg() processes a scalar argument of a task and inserts it into the task's data structure. This function also performs any necessary data allocation and transfer operations.
  • starpu_task_insert_data_process_array_arg() processes an array argument of a task and inserts it into the task's data structure. This function handles the allocation and transfer of the array data, as well as setting up the appropriate metadata to describe the array.
  • starpu_task_insert_data_process_mode_array_arg() processes a mode array argument of a task and inserts it into the task's data structure. This function handles the allocation and transfer of the mode array data, as well as setting up the appropriate metadata to describe the mode array. Additionally, this function also computes the necessary sizes and strides for the data associated with the mode array argument.