Wednesday, January 10, 2007

Moving to Cell SDK 2.0

If you have enjoyed hacking with the Cell SDK 1.1,  it is perhaps time to port your code to the SDK 2.0 and experiment with the new available features. In this and the upcoming posts I will be covering some of the news of the SDK 2.0 as well as some of issues you have to face when porting code from SDK 1.1 to SDK 2.0.

The first thing worth mentioning is that the SDK 2.0 has upgraded to GCC 4.1.1 and XLC 8.1. These compiler should provide better performance than those which were shipped with the SDK 1.1. My first few measurement with GCC confirm this statement.

Beware that there has been some restructuring and cleaning up of the directory structure as well as include path. Thus if you has  some  Makefile, or configuratio file which were relying on the SDK 1.1 structure you'll have to workout some changes.

Another difference that you'll find in the SDK 2.0 is that now it is not necessary any more to explicitely import the libc.a when compiling SPU code. Thus you will have to remove all the  $(SDKLIB_spu)/libc.a that might be contained in the IMPORT variable in your makefile.
Beware that  in SDK 2.0 you always have to include  the header cbe_mfc.h  before the header math.h. If you do viceversa you'll see some compilation errors on the function __fabs. 

Wednesday, November 15, 2006

PPE-SPE Synchronization

Cell provides mailboxes in order to nicely synchronize SPEs and PPE. However, there is something that should be taken into account by programmers who care about performances. When the SPE use mailboxes to communicate with the PPE, the latter performs a DMA transaction for each tentative read of the mailbox. The end result is that a lot of bus traffic is generated, which might adversely impacts performances. Another approache that, while less user friendly, ensure maximum performances is tu rely on Spinlocks and DMA transfers. The basic idea is to have the PPE and the SPE coordinate by writing some agreed areas of the main memory and local store respectively. The code below shows how you might achieve this. Actually the sample code, allows you to measure the performance of this spinlock synchronization, which you can easily compare with mailbox based synchronization.

---------- SPE Code ------------
#include <stdio.h>
#include <stdlib.h>

#include <libspe.h>
#include <cbe_mfc.h>

// -- Common Include --
#include "cbench/common.h"

extern spe_program_handle_t cbench_spinlock_spu;

volatile unsigned long long spinlock     __attribute__((aligned(128)));
unsigned long long          spinlock_spu __attribute__((aligned(128)));

int main(int argc, char* argv[])
  speid_t speid;
  int     status;
  int     tagid = 1;

  unsigned int run;

  char TEST_ID[32] ="SPINLOCK:PPU>";
  if (argc < 2) {
      printf("USAGE:\n\tspinlock_spu <sync-num>\n");
      return 1;

  run = atoi(argv[1]);
  if (run == 0) {
      run = 1;

  printf("%s PPU Spinlock at [0x%p]\n", TEST_ID, &spinlock);
  spinlock = 0;

  speid = spe_create_thread( 0,
                 (unsigned long long*)&spinlock,
                 (unsigned long long*)run,
                 0 );
  if(speid == 0){
    perror( "Unable to create SPE thread\n");
    return -1;
  unsigned int i;

  for (i = 0; i < run; ++i) {
      while (spinlock == 0) { }
      spinlock_spu = spinlock;
      spinlock = 0;

      /* Now the spinlock contains the LS address for the SPU spinlock */

  spe_wait(speid, &status, 0);
  printf("%s DONE!\n", TEST_ID);
  return 0;

--------- SPE Code --------------
#include <libsim.h>
#include <sim_printf.h>
#include <spu_intrinsics.h>
#include <spu_mfcio.h>
#include <cbe_mfc.h>
#include <profile.h>

volatile unsigned long long spinlock     __attribute__ ((aligned (128)));

 * This has to be an unsigned long long, just because we need to write
 * this much on the destination spinlock.
unsigned long long spinlock_spu_ls __attribute__ ((aligned (128)));
unsigned int       spinlock_ppu_ea __attribute__ ((aligned (128)));

int main(unsigned long long spuid,
     addr64 argp,
     addr64 envp)
  int tag_id = 0;

  char TEST_ID[32] ="SPINLOCK:SPU>";

  spinlock_ppu_ea = argp.ui[1];
  spinlock_spu_ls = (unsigned int)&spinlock;

  spinlock = 0;

  unsigned int run = envp.ui[1];

  sim_printf("%s TEST STARTED\n", TEST_ID);
  sim_printf("%s Performing %u Spinlock measurements\n", TEST_ID, run);
  sim_printf("%s PPU Spinlock at  [0x%x]\n", TEST_ID, spinlock_ppu_ea);
  sim_printf("%s SPU Spinlock at  [0x%llx]\n", TEST_ID, spinlock_spu_ls);
  unsigned int i;
  for (i = 0; i < run; ++i) {
       * Write the SPU Spinlock EA into the PPU spinlock EA.
           (unsigned int)spinlock_ppu_ea,
       * Wait for the PPU to activate an SPU initiated DMA to set the spinlock.
      while (spinlock == 0) { }

      sim_printf("======================[%d]=========================\n", i);
      spinlock = 0;
  sim_printf("%s TEST COMPLETED\n", TEST_ID);

  return 0;

Sunday, November 12, 2006

Cell BE Workshop

A few weeks ago there was a summit on Software and Algorithms for the Cell Processor. The material presented at the summit can be found at here. Several interesting slides set are available, however I would recommend Cell programmers to take a look at Cell BE Programming Gotchas. This presentation has a few things you always want to keep in mind when programming Cell. I would add a few things to that, and would probably make some logs to contribute some more gotchas.

Another presentation worth looking at, is Experience Programming Cell. This presentation shows some good hints on the kind of computation that Cell enjoys, and those that it does not. To many people this should not come to a surprise, but it is always good to set the stage for everyone. After looking at this presentation, then you might want to look at this presentation on the Roadrunner Supercomputer.

Undocumented Corner

SDK-1.1 ftt_1d_r2

In case you try to use the SDK 1.1 fft_1d_r2 implementation, be aware that it only works for vectors of complex of minimum 32 elements and maximum 8K elements.

Cell Broadband Engine

In the last few weeks I've been working on a new cool processor,
called the Cell Broadband Engine. This is an heterogeneous multi-core processor
which will be powering the upcoming Sony PlayStation 3. The
architecture of the processor is rather neat, and programming it feels
like doing distributed computing, which is one of the thing I am used
to do the most.

Cell BE Blog

I used to post Cell BE related blogs in my online collection
ofrandom thoughts. I thougt that perhaps it would have been better to
have a Blog dedicated to Cell, thus I went ahead and created it. I'll
be moving all the Cell related blog entry over here.