Berkeley UPC User's Guide version 2.16.0 |
This version of Berkeley UPC includes
#include <upc_relaxed.h> #include <stdio.h> int main() { printf("Hello from thread %i/%i\n", MYTHREAD, THREADS); upc_barrier; return 0; }This program prints a message once from each thread (in some arbitrary interleaving), executes a barrier (optional), and exits.
For more involved examples of UPC code, see the UPC Language Tutorials - from the UPC Language Community website and the 'upc-examples' directory in of the Berkeley UPC runtime distribution. The official UPC language specification is a useful reference, and contains a description of the standard libraries.
upcc -o light particle.upc wave.c -lgrottymathNote that 'wave.c' can contain either UPC code or regular C code, and the 'grottymath' library that is linked into the application can be a regular C library: Berkeley UPC is fully interoperable with regular C source, object, and library files (note: if you compile with the -pthreads flag, any C libraries you use must be thread-safe). Berkeley UPC 2.0 also adds support for linking C++/FORTRAN/MPI objects into a UPC executable: see Mixing C/C++/MPI/FORTRAN with UPC.
upcc recognizes most commonly used C compiler flags (-D, -I, etc.). It also uses a number of its own flags for the choice of network API your program will run over, for compiling your UPC code for a static number of threads, and other UPC-specific options. See the upcc man page for details.
Name | Description |
lapi | LAPI API for IBM SP networks |
gm | GM API for Myrinet networks |
elan | elan API for Quadrics networks |
ibv | OpenFabrics (aka OpenIB) InfiniBand Verbs for most InfiniBand networks |
mxm | MXM API for Mellanox InfiniBand HCAs |
vapi | (older) Verbs API for Mellanox-based InfiniBand networks |
sci | SISCI API for Dolphin-based SCI networks (EXPERIMENTAL- currently requires the Linux BigPhysMem kernel patch in order to get more than 1MB of shared heap space) |
shmem | SHMEM API for SGI Altix systems and the Cray X1. Other systems providing a SHMEM API may also work, but have not been tested. |
gemini (beta) | GNI API for Cray XE systems running CNL. (BETA- performance of current implementation is untuned) |
portals | Portals API for Cray XT systems running Catamount or CNL. |
dcmf | Deep Computing Messaging Framework for IBM BlueGene/P systems |
udp | UDP: works on any system with a standard TCP/IP stack, but is typically slower than using one of the native network types. Generally the fastest option for systems with only Ethernet hardware (notably faster than MPI-over-TCP). |
mpi | MPI: works on any system with MPI installed, but is typically slower than using one of the other network types. |
smp | "Symmetric multiprocessor (SMP)" mode: uses no network. Currently runs with only a single process unless your runtime has been configured with --enable-pshm (currently default only on Linux). Otherwise, you must pass -pthreads to upcc to run smp-conduit with multiple UPC threads. |
Note that you can only compile for a given network type if your Berkeley UPC runtime was configured to support it at build/installation time. To see which APIs are supported in your installation, and to see which is used by default, use 'upcc --version'.
An executable compiled for a fixed number of UPC threads will fail at startup if you try to run it with a different number of threads. However, fixing the number of threads allows optimization on certain operations (such as shared pointer arithmetic), especially when the number of threads is a power of 2.
Name | Value | Description | Standard |
__UPC__ | 1 | Defined by any UPC implementation | UPC language |
__UPC_VERSION__ | Monotonically increasing positive integer constant | UPC specification supported: value is YYYYMM date of that version's ratification (ex: '200310L)' | UPC language |
__UPC_STATIC_THREADS__ | 1 if static threads: else undefined | Set to 1 if the '-T' flag was passed to upcc | UPC language |
__UPC_DYNAMIC_THREADS__ | 1 if dynamic threads: else undefined | Set to 1 unless the '-T' flag was passed to upcc | UPC language |
__BERKELEY_UPC__ | Monotonically increasing positive integer constant | The major version number of the Berkeley UPC release. Example: '1' for release '1.0.3'. | Berkeley UPC only |
__BERKELEY_UPC_MINOR__ | An integer constant | The minor version number of the Berkeley UPC release. Example: '0' for release '1.0.3'. | Berkeley UPC only |
__BERKELEY_UPC_PATCHLEVEL__ | An integer constant | The patch version number of the Berkeley UPC release. Example: '3' for release '1.0.3'. | Berkeley UPC only |
__BERKELEY_UPC_<NETWORK>_CONDUIT__ | 1, or undefined | Identifies the network API used. Example: if 'upcc -network=mpi' is used, '__BERKELEY_UPC_MPI_CONDUIT__' will be defined, with the value of 1 | Berkeley UPC only |
__BERKELEY_UPC_PTHREADS__ | 1, or undefined | Defined to 1 if and only if the '-pthreads' flag is used | Berkeley UPC only |
__BERKELEY_UPC_RUNTIME__ | 1, or undefined | Defined to 1 if and only if the Berkeley UPC runtime is used, regardless of whether the Berkeley UPC translator or GCCUPC is used | Berkeley UPC and GCCUPC+UPCR |
__BERKELEY_UPC_RUNTIME_DEBUG__ | 1, or undefined | Defined to 1 if and only if a debugging runtime used (i.e. '-g' passed to upcc). | Berkeley UPC and GCCUPC+UPCR |
__BERKELEY_UPC_RUNTIME_RELEASE__ | An integer constant, or undefined | The major version number of the Berkeley UPC Runtime library. Example: '2' for release '2.12.0'. | Berkeley UPC and GCCUPC+UPCR. Undefined prior to release 2.12.0 |
__BERKELEY_UPC_RUNTIME_RELEASE_MINOR__ | An integer constant, or undefined | The minor version number of the Berkeley UPC Runtime library. Example: '12' for release '2.12.0'. | Berkeley UPC and GCCUPC+UPCR. Undefined prior to release 2.12.0 |
__BERKELEY_UPC_RUNTIME_RELEASE_PATCHLEVEL__ | An integer constant, or undefined | The patch version number of the Berkeley UPC Runtime library. Example: '0' for release '2.12.0'. | Berkeley UPC and GCCUPC+UPCR. Undefined prior to release 2.12.0 |
A remote translator can be used either over HTTP, or SSH. To use HTTP, the the 'upcc.cgi' CGI script (located in the 'contrib' directory of the runtime distribution) must be installed and configured with a web server on the remote host. Simply set the 'translator' parameter in your user configuration file (or the global 'upcc.conf') to the URL for the CGI script. To use SSH, you must be able to login to the remote host using SSH, and the 'translator' parameter must be set to 'remote_host:/path/to/translator'. You will want to use key-based authentication, and 'ssh-agent' to avoid entering your password each time you compile. See our SSH Agent Tutorial.
When using an HTTP-based remote translator, upcc also includes support for use of an HTTP proxy. Set the 'http_proxy' parameter in your user configuration file (or the global 'upcc.conf') to the proxy URL. The upcc front end does not currently support HTTPS or SOCKS proxies, nor HTTP proxies that require authentication (HTTP error 407).
If you wish to create a reusable set of compiled code, you must currently keep the files in *.o format. So, instead of the traditional C format, where you'd create 'libmyupc.a', and then link with something like
upcc myprogram.o -L/libpath -lmyupcYou must instead do something like
upcc myprogram.o /libpath/libmyupc/*.oNote that beginning with Berkeley UPC 2.12.0 it is possible to link together static threads and dynamic threads objects, with the result being a static threads executable. In many cases this allows use of a dynamic threads object in the role of a library, which can be linked to an executable with any dynamic or static thread setting.
Berkeley UPC executables should be run the same way as any other parallel program on your system that uses the same underyling network API. So, for instance, a program compiled with '--network=mpi' is run on many systems via 'mpirun -np <number of processes> a.out'. Other systems may use other invocations, such as 'prun' or 'poe', especially when API's other the MPI are used. Consult your system's documentation for details.
upcrun -n 4 parboilThis example runs the UPC executable 'parboil' on 4 nodes.
An additional benefit of using upcrun is that it provides consistent support for propagating environment variables to all threads of your UPC program. If you use upcrun, any environment variable beginning with either 'UPC_' or 'GASNET_' is guaranteed to be propagated to all threads. (Support for propagating all environment variables is planned). If you do not use upcrun, environment propagation will only work to the extent that the parallel job launcher you use provides it normally.
You can see how upcrun thinks your job should be run without actually running it by passing the '-t' flag to it. Also, 'upcrun -i <executable>' will provide information about a Berkeley UPC executable, such as the network API that it was built against, and the number of fixed threads (if any) that it was compiled for.
See 'upcrun --help' or the upcrun man page for more information.
The default amount of shared memory to reserve per UPC thread on a system is chosen at configure time (see the INSTALL.TXT document in the runtime distribution for details), but you can override that value for a particular application either at compile time, or at startup. Generally this is only needed if you observe that your application is running out of either shared or regular C memory.
To embed a different default amount of shared memory into your application, simply pass '-shared-heap=144MB' for instance (to get 144 megabytes per UPC thread). You can also use 'GB' for gigabyte amounts (if neither 'MB' nor 'GB' is used, megabytes are assumed). To override the embedded default amount of shared memory at application startup, set the UPC_SHARED_HEAP_SIZE environment variable to whatever value you want ('2GB', etc.), or pass '-shared-heap' to upcrun.
While it is tempting to simply grab an extremely large shared memory segment, be aware that this is not always a good idea, or even possible. Since the shared address space range cannot be used for regular malloc allocations, creating too large of a shared space can cause the amount of regular heap memory available to your application to become small (causing malloc to eventually return NULL when you request more memory). Also, the shared memory space is reserved via an mmap() call, and while this does not generally cause any physical memory pages to be allocated, certain operating systems (for instance, Linux) will not allow more memory to be reserved by applications then the OS can guarantee is available, and so allocating a shared region larger than the physical memory (plus swap space) may fail.
The default amount of shared memory per UPC thread can be changed system-wide by modifying the 'shared_heap' parameter in the installation's upcc.conf file. You can override the system-wide default for your own applications by setting shared_heap in your user configuration file.
The upcc.conf file also provides a 'heap_offset' parameter (and upcc provides a '-heap-offset' flag) that affects where the address region for shared memory is located in your program. However, at present it is not useful on any of our supported systems, and so we do not recommend its use.
The '-pthreads' flag must be passed consistently at all stages of compilation and linking. Also, when pthreads are used, upcc needs to delay much of the compilation of your code until link time, so if you split code generation into separate compilation and linking steps (i.e., 'upcc -c foo.upc', followed by 'upcc foo.o bar.o'), you need to pass any macro and/or include path directives (ex: '-DFOO=bar -I/usr/local/include') to upcc to both the compilation and link commands.
Any C libraries that your code links against must be thread-safe in order to be used with -pthreads. If one or more of your libraries is not thread-safe, you must compile without pthreads, and run separate processes on the same machine to exploit an SMP system. Currently, such processes will not use any shared memory optimizations, and will communicate with each other via the network API. Support for shared memory between non-pthreaded Berkeley UPC processes will be provided in the near future.
When you link an application with '-pthreads', a subdirectory named <executable_name>_pthread-link will be created in the current directory. This directory exists in order to speed up further linking commands of the same program. If you link the same application again with the same object file names, and none of the global static unshared variables in your program have changed name or size, recompilation of all the files in your application can be avoided, which can make a significant difference in build time for programs with many source files. You may delete the temporary directory at any time without any side effects (other than possibly longer link times).
Unless otherwise specified, pthreaded UPC applications use a default number of pthreads per process (run 'upcc --version' to see the default for your system. This number is set in the upcc.conf configuration file, and can be changed there (or in your user configuration file). It can also be overridden in several ways. Compiling with 'upcc -pthreads=<NUMBER>' changes the default number of pthreads per UPC process for an executable to NUMBER. If the 'UPC_PTHREADS_PER_PROC' environment variable is set to a nonzero integer when you run a UPC program, it will override any default value. Finally, upcrun is smart about pthreads in several ways. First, if you run a pthreaded parallel job with 'upcrun -n <NUMBER> ...', the number of processes actually launched will be divided by the number of pthreads, so that exactly NUMBER UPC threads are used. Second, if you use -network=smp (which generates a executable that will run only a single process), upcrun -n NUMBER will automatically set the number of pthreads to NUMBER.
Berkeley UPC programs can now be debugged (with support for UPC-specific constructs) by the "TotalView" debugger produced by Rogue Wave Software. TotalView version 7.0.1 or greater is required, and support is currently only provided on x86 architectures, using either MPI or Quadrics/elan for the network. See our tutorial on using Berkeley UPC with TotalView for details.
If you do not have TotalView, you can also use a regular C debugger and get partial debugging support. Berkeley UPC provides several mechanisms for attaching a regular C debugger to one or more of your UPC application's threads at various points during execution. While this does not provide a fully normal debugging environment (the debugger will show the C code emitted by our translator, rather than your UPC code), it can still allow you to see program stack traces and other important information. This can be very useful if you wish to submit a helpful bug report to us. See Attaching a regular C debugger to Berkeley UPC programs for details.
Berkeley UPC also supports automatically generating backtraces if a fatal error occurs in your program. This will allow you to see a stack trace of the function calls that your program was in at the time it crashed. To use auto-backtracing, run with upcrun -backtrace or set GASNET_BACKTRACE=1 in your environment. The level of backtracing support available depends on the back-end C compiler and operating system, and so not all systems are equally functional, and some systems will not provide backtraces. See gasnet/README for more information on backtracing.
Examining tracing information is one of the best ways to go about optimizing your UPC programs. It provides a way for you to see which lines of your code are generating the most network traffic (and the size of the network messages used). From this you may be able to determine how to either avoid some of this traffic, or change your code to use fewer, larger messages (for instance, by replacing sets of individual reads/writes with bulk memory movement calls like 'upc_memget()', etc.), which is typically more efficient. Examining barrier wait times can also let you know if your computations are imbalanced across threads, and/or if you could profit by using split-phase barriers, moving computation in between 'upc_notify' and 'upc_wait'.
Note that running with tracing may slow down your application considerably: the exact amount depends on your filesystem, and the ratio of communication/computation in your program. If you are only interested in a subset of trace information, consider setting 'GASNET_TRACEMASK' as described below.
ID | Feature |
G | Network 'gets'. These include both bulk gets (from upc_memget, etc.), and network get operations caused by reading shared memory via shared variables/pointers. The 'g' mask does not include 'local' gets (i.e. writes to shared memory which has affinity to the writing UPC thread), as these do not result in network traffic. Use 'H' to trace local gets. |
P | Network 'puts'. These include both bulk puts (from upc_memput, etc.), and put operations caused by writing to shared memory via variables/pointers. The 'P' mask does not include 'local' puts (i.e. writes to shared memory which has affinity to the writing UPC thread), as these do not result in network traffic. Use 'H' to trace local puts. |
B | Barriers, including both blocking (upc_barrier) and non-blocking (upc_notify followed by upc_wait: a pair of these count as a single barrier). |
N | Line number information from UPC source files. The "N" and "H" flags must always be among those set for upc_trace to work! |
H | Miscellaneous UPC information. The "N" and "H" flags must always
be among those set for upc_trace to work! Passing
this flag causes the following things to be traced:
|
To trace only a subset of these features, set the 'GASNET_TRACEMASK' environment variable to a string containing the ID's of the features you wish to trace. Note that the "N" and "H" flags must always be among those set for 'upc_trace' to work (if you are intending to manually examine the trace file, they do not need to be set).
So, for instance, if you are trying to perform an analysis that does not require get/put information, you are highly advised to set 'GASNET_TRACEMASK' to "BHN" and 'GASNET_TRACELOCAL' to "no" (or "0"). This will turn off tracing for all get and put operations. Since gets/puts are typically the majority of items in a full trace file, this will probably result in much faster program execution, a much smaller trace file, and faster analysis by 'upc_trace'.
extern void bupc_trace_setmask (const char *newmask); extern const char * bupc_trace_getmask (); extern int bupc_trace_gettracelocal (); extern void bupc_trace_settracelocal (int val); void bupc_trace_printf ((const char *msg, ...));'bupc_trace_getmask' and 'bupc_trace_setmask' allow programmatic retrieval and modification of the trace masks in effect for the calling thread. The initial values are determined by the 'GASNET_TRACEMASK' environment variables, and the input and output to the mask manipulation functions have the same format as 'GASNET_TRACEMASK' values. Note that whenever any tracing is enabled (i.e. unless you are temporarily turning off tracing by passing an empty string), the "N" and "H" flags must always be among those set for 'upc_trace' to work.
Ex: bupc_trace_setmask("PGHN"); // trace everything bupc_trace_settracelocal(1); // include local puts and gets // do something... bupc_trace_setmask(""); // stop tracing
The 'bupc_trace_printf' utility outputs a message into the trace file, if it exists. Note that two sets of parentheses are required when invoking this operation, in order to allow it to compile away completely for non-tracing builds.
Ex: double A[4] = ...; int i = ...; bupc_trace_printf(("the value of A[%i] is: %f", i, A[i]));
To generate statistics, simply set the 'GASNET_STATSFILE' environment variable to a file name, into which statistics will be written at the end of your program's run. (Note: by default, only debug executables support statistics generation, as it incurs a performance penalty: if you wish to have non-debug UPC executables generate statistics, you must rebuild your UPC runtime system, passing '--with-multiconf=+opt_trace' to configure, then build your application with 'upcc -trace'.) You may generate both stats and tracing info for the same program run if you wish.
Just as with tracing, you may set a mask to control what types of events are included in the statistics, by setting the 'GASNET_STATSMASK' environment variable, and/or by calling the following functions:
extern void bupc_stats_setmask (const char *newmask); extern const char * bupc_stats_getmask ();The same mask IDs are used by the tracing and statistics masks, i.e., calling 'bupc_stats_setmask("BP")' would cause execution to gather statistics only for barriers and puts. See the table in the tracing documentation for the list of IDs.
upcc -pg foo.c upcrun -n 2 a.out gprof a.out gmon.out.0/gmon.out gmon.out.1/gmon.out | lessNote that 'gprof' provides timings and statistics for processor usage: it does not include time during which the process has been put to sleep waiting for I/O (including network reads/writes). However, since Berkeley UPC uses spin-locks in many cases to wait for network events, rather than blocking system calls, you may see that certain 'gasnet*' functions consume large amounts of CPU time. This generally means that your program is spending most of that time waiting for network communication to complete (some fraction is the software overhead inherent in sending/receiving the network traffic). If your program spends a lot of time waiting for network operations to complete, you may be suffering from an imbalanced load across threads (so that some take longer to "catch up" to a barrier, for instance). Restructuring your application may avoid these waiting periods. Or you may be able to use some of this "spare" time for computation (or other network traffic) by switching to use non-blocking barriers (i.e., 'upc_notify/upc_wait'), and/or our nonblocking memcpy extensions to UPC. Replace blocking network constructs (such as 'upc_barrier', 'upc_memcpy', and read/writes to shared variables) with non-blocking equivalents, and insert unrelated computation (and/or network traffic) in between the initialization and completion calls. Of course, you must be able to find unrelated computation/communication for this to work, and the degree to which this is possible will depend on your application.
The full interface is described in our Proposal for Extending the UPC Memory Copy Library Functions. See that document for details on the functions and their usage.
The full interface is described in our Proposal for Extending the UPC Libraries with Explicit Point-to-Point Synchronization Support. See that document for details on the functions and their usage.
int bupc_dump_shared(shared const void *ptr, char *buf, int maxlen);Any pointer to a shared type may be passed to this function. The 'maxlen' parameter gives the length of the buffer pointed to by 'buf', and this length must be at least BUPC_DUMP_MIN_LENGTH, or else -1 is returned, and errno set to EINVAL. On success, the function returns 0, The buffer will contain either "<NULL>" if the pointer to shared == NULL, or a string of the form
"<address=0x1234 (addrfield=0x1234), thread=4, phase=1>"The 'address' field provides the virtual address for the pointer, while the 'addrfield' contains the actual contents of the shared pointer's address bits. On some configurations these values may be the same (if the full address of the pointer can be fit into the address bits), while on others they may be quite different (if the address bits store an offset from a base initial address that may differ from thread to thread).
Both bupc_dump_shared() and BUPC_DUMP_MIN_LENGTH are visible when any of the standard UPC headers (upc.h, upc_relaxed.h, or upc_strict.h) are #included.
The 'bupc_ptradd()' function provides support for performing pointer-to-shared arithmetic with general blocksize, which need not be compile-time constant.
shared void * bupc_ptradd(shared void *p, size_t blockelems, size_t elemsz, ptrdiff_t elemincr); - 'p': the base pointer - 'blockelems': the block size (number of elements in a block) - 'elemsz': the element size (usually sizeof(*p)) - 'elemincr': the positive or negative offset from the base pointer
The following call:
bupc_ptradd(p, blockelems, sizeof(T), elemincr);Returns a value q as if it had been computed:
shared [blockelems] T *q = p; q += elemincr;however, the blockelems argument is not required to be a compile-time constant. Blockelems must be non-negative, but may be zero to indicate an indefinite blocking factor. Here's an example of indexing into a dynamically-allocated array whose block size is not known until run time.
int blockelems = ...; // choose some arbitrary block size // allocate an array of doubles with that blocksize shared void *myarr = upc_all_alloc(..., blockelems*sizeof(double)); // access element 14 double d = *(shared double *)bupc_ptradd(myarr, blockelems, sizeof(double), 14);
It's worth noting that in some cases bupc_ptradd() may be less efficient than regular pointer-to-shared addition, because the compile-time constant blocksize of the pointer referent type generally makes the latter more amenable to compiler optimization of the addition operation and surrounding code. This is especially true in the case of indefinitely blocked or cyclically blocked pointers-to-shared. However, the cost may be worth the added convenience in non-performance-critical code.
You will normally not need to call this function, as the runtime will automagically perform checks for incoming network requests whenever your UPC code causes network activity to be performed, and this usually occurs fairly frequently in a UPC application. However, if you writing your own 'spin lock' style synchronization, you may need to use this function to avoid deadlock. Here is an example:
shared strict int flag[THREADS]; ... if (MYTHREAD % 2) { while (flag[MYTHREAD] == 0) bupc_poll(); } else { ... some calculation ... flag[MYTHREAD - 1] = 1; }Here the 'even' UPC threads are performing some calculation, then informing the 'odd' threads that the result is ready by setting a per-thread flag. If the 'bupc_poll()' were omitted, the 'odd' threads might (on certain platforms/networks) consume all of the CPU forever in the 'while' test, never checking for the incoming network message that would set flag[MYTHREAD].
If a program contains computationally intensive sections in which no remote accesses are performed for a long time, it is also possible that performance may be improved by intermittently calling bupc_poll, particularly if other threads are likely to be performing remote accesses (or memory allocation requests) during this time.
If 'expr' has a static type which is identical to 'type', does nothing. Otherwise, prints a non-fatal warning containing the line number and a description of the two differing types.
NOTICE: See the end of this section for information on new standardized interfaces which will replace these Berkeley-specific ones in a future release.
typedef bupc_tick_t; /* 64-bit integral type */ #define BUPC_TICK_MAX #define BUPC_TICK_MIN bupc_tick_t bupc_ticks_now (); uint64_t bupc_ticks_to_us (bupc_tick_t ticks); uint64_t bupc_ticks_to_ns (bupc_tick_t ticks); double bupc_ticks_granularityus (); double bupc_ticks_overheadus ();The 'bupc_tick_t' type and associated functions provide portable support for querying high-precision system timers for obtaining wall-clock timings of sections of code. Most CPU hardware offers access to high-performance timers with a handful of instructions, providing timer precision and overhead that can be several orders of magnitude better than can be obtained through the use of the gettimeofday() system call.
The 'bupc_tick_t' type represents an integral quantity of abstract timer ticks, whose ratio to real time is system-dependent and thread-dependent. bupc_ticks_now() returns the current value of the tick timer for the calling thread, using the fastest mechanism available. bupc_ticks_to_us() and bupc_ticks_to_ns() convert a difference in bupc_tick_t values obtained by the calling thread into microseconds or nanoseconds, respectively. The bupc_ticks_to_{us,ns}() conversion calls can be significantly more expensive than the bupc_ticks_now() tick query, so for timing short intervals it's recommended to keep timing results in units of ticks until final output. BUPC_TICK_MAX and BUPC_TICK_MIN provide tick values which are respectively larger and smaller than any possible tick value. bupc_ticks_granularityus() and bupc_ticks_overheadus() respectively report the estimated microsecond granularity (minimum time between distinct ticks) and microsecond overhead (time it takes to read a single tick value, not including conversion) for the timer facility.
Example: bupc_tick_t start = bupc_ticks_now(); compute_foo(); /* do something that needs to be timed */ bupc_tick_t end = bupc_ticks_now(); printf("Time was: %d microseconds\n", (int)bupc_ticks_to_us(end-start)); printf("Timer granularity: <= %.3f us, overhead: ~ %.3f us\n", bupc_tick_granularityus(), bupc_tick_overheadus()); printf("Estimated error: +- %.3f %%\n", 100.0*(bupc_tick_granularityus()+bupc_tick_overheadus()) / bupc_ticks_to_us(end-start));It's important to keep in mind that raw bupc_tick_t values are thread-specific quantities with a thread-specific interpretation (e.g. they might represent a hardware cycle count on a particular CPU, starting at some arbitrary time in the past). More specifically, raw ticks do NOT provide a globally-synchronized timer (i.e. the simultaneous absolute tick values may differ across threads), and furthermore the tick-to-wallclock conversion ratio might also differ across threads (e.g. on a cluster with heterogenerous CPU clock rates, the raw tick values may advance at different rates for different threads). Therefore as a rule of thumb, raw bupc_tick_t values and bupc_tick_t intervals obtained by different threads should never be directly compared or arithmetically combined, without first converting the relevant tick intervals to wall time intervals.
NOTICE: The following are slated for inclusion in the upcoming 1.3 revision of the UPC specification, with the same semantics as described above, and are available in Berkeley UPC beginning with the 2.16 release. Users are strongly encouraged to develop new codes using the standardized interfaces and to migrate existing code to them. Once revision 1.3 of the UPC spec is final, all Berkeley-specific timer interfaces will become deprecated.
#define __UPC_TICK__ 1 // predefined feature macro #include <upc_tick.h> // defines the following: typedef upc_tick_t; #define UPC_TICK_MAX #define UPC_TICK_MIN upc_tick_t upc_ticks_now (void); uint64_t upc_ticks_to_ns (upc_tick_t ticks);
unsigned int bupc_thread_distance(int threadX, int threadY); #define BUPC_THREADS_SAME #define BUPC_THREADS_VERYNEAR #define BUPC_THREADS_NEAR #define BUPC_THREADS_FAR #define BUPC_THREADS_VERYFARbupc_thread_distance takes two thread identifiers (whose values must be in 0..THREADS-1, otherwise behavior is undefined), and returns an unsigned integral value which represents an approximation of the abstract 'distance' between the hardware entity which hosts the first thread, and the hardware entity which hosts the memory with affinity to the second thread. In this context 'distance' is intended to provide an approximate and relative measure of expected best-case access time between the two entities in question. Several abstract 'levels' of distance are provided as pre-defined constants for user convenience, which represent monotonically non-decreasing 'distance':
These constants have implementation-defined integral values which are monotonically increasing in the order given above. Implementations may add further intermediate level with values between BUPC_THREADS_VERYNEAR and BUPC_THREADS_VERYFAR (with no corresponding define) to represent deeper hierarchies, so users should test against the constants using <= or >= instead of ==.
The intent of the interface is for users to not rely on the physical significance of any particular level and simply test the differences to discover which threads are relatively closer than others. Implementations are encouraged to document the physical significance of the various levels whenever possible, however any code based on assuming exactly N levels of hierarchy or a fixed significance for a particular level will probably not be performance portable to different implementations or machines.
The relation is symmettric, ie:
bupc_thread_distance(X,Y) == bupc_thread_distance(Y,X)
Furthermore, the value of bupc_thread_distance(X,Y) is guaranteed to be unchanged over the span of a single program execution, and the same value is returned regardless of the thread invoking the query.
int bupc_castable(shared void *ptr); int bupc_thread_castable(unsigned int threadnum); void * bupc_cast(shared void *ptr);
This family of functions implements a UPC language extension proposed by Brian Wibecan of HP. Their purpose is to allow a UPC programmer to take advantage of UPC implementations in which some or all of the data of a given UPC thread can be directly addressed by other UPC threads.
We use the term 'castable' to denote that the UPC implementation is able to represent a given pointer-to-shared using a pointer-to-private in a given thread. Any pointer-to-shared with affinity to a thread is also castable by that same thread. However, in general shared storage with affinity to one thread is not castable by other threads. Depending on the UPC implementation, it is possible that for a given pair of threads either all, none, or only some of the shared address space with affinity to the first may be castable by the second.
bupc_castable() takes a shared pointer as argument and returns non-zero if and only if the argument is castable by the calling thread. It is guaranteed that a call to bupc_castable() with an argument having affinity to the calling threads will always return non-zero.
bupc_thread_castable() takes a UPC thread number as argument and returns non-zero if and only if every pointer-to-shared with affinity to the argument thread is castable by the calling thread. It is guaranteed that bupc_thread_castable(MYTHREAD) is always non-zero.
bupc_cast() takes a shared pointer as argument and returns a pointer-to-private. The returned pointer may be used to reference the same object as the argument only if the argument pointer is castable by the calling thread, as may determined by bupc_castable() or bupc_thread_castable(). Otherwise the returned pointer is undefined (in particular there is no guarantee that the return value is NULL).
In addition to the three functions above, which have been proposed (with upc_ prefixes) for inclusion in the UPC language specification, Berkeley UPC 2.14.2 and later implement an 'inverse cast' function:
shared void * bupc_inverse_cast(void *ptr);This function takes a pointer-to-private argument and returns a pointer-to-shared referencing the same location if and only if the argument references a shared object. If the argument is NULL or references a location not in the shared space, then the return value is (shared void *)NULL.
type bupc_atomicX_read_RS(shared void *ptr); void bupc_atomicX_set_RS(shared void *ptr, type val); type bupc_atomicX_swap_RS(shared void *ptr, type val); type bupc_atomicX_mswap_RS(shared void *ptr, type mask, type val); type bupc_atomicX_cswap_RS(shared void *ptr, type oldval, type newval); type bupc_atomicX_fetchadd_RS(shared void *ptr, type op); type bupc_atomicX_fetchand_RS(shared void *ptr, type op); type bupc_atomicX_fetchor_RS(shared void *ptr, type op); type bupc_atomicX_fetchxor_RS(shared void *ptr, type op); type bupc_atomicX_fetchnot_RS(shared void *ptr);Where type and X take on the values of each pair from the following table, and RS is either `strict' or `relaxed'.
Type | X |
int | I |
unsigned int | UI |
long | L |
unsigned long | UL |
int64_t | I64 |
uint64_t | U64 |
int32_t | I32 |
uint32_t | U32 |
This family of functions provide atomic read, write and read-modify-write of the indicated data types. When these functions are used to access a memory location in a given synchronization phase, atomicity is guaranteed if and only if no other mechanisms are used to access the same memory location in the same synchronization phase. Memory accesses are relaxed or strict as indicated by the function names.
The swap functions set the location given by the first argument to the value of the second argument while atomically returning the prior value. The mswap (masked swap) functions atomically update the location given by the first argument to a value obtained by replacing those bits set in mask with the corresponding values from val, while returning the prior value.
The cswap (conditional swap) functions atomically set the location given by the first argument to the value newval only if the current value is equal to oldval, but return the prior value regardless of whether the write was performed.
The fetchadd functions atomically add the second argument to the location given by the first argument and return the value prior to the addition. Similarly, the fetchand, fetchor and fetchxor functions atomically perform the appropriate bit-wise operation and return the value prior to the operation. The fetchnot functions atomically perform a bit-wise negation of the location given by the argument and return the value prior to the negation.
In addition to the relaxed and strict atomic operations on shared data, the following are available to operate on private pointers, including pointers to shared data with local affinity. Other than the type of the first argument, these functions operate identically to the relaxed atomics above.
type bupc_atomicX_read_private(void *ptr); void bupc_atomicX_set_private(void *ptr, type val); type bupc_atomicX_swap_private(void *ptr, type val); type bupc_atomicX_mswap_private(void *ptr, type mask, type val); type bupc_atomicX_cswap_private(void *ptr, type oldval, type newval); type bupc_atomicX_fetchadd_private(void *ptr, type op); type bupc_atomicX_fetchand_private(void *ptr, type op); type bupc_atomicX_fetchor_private(void *ptr, type op); type bupc_atomicX_fetchxor_private(void *ptr, type op); type bupc_atomicX_fetchnot_private(void *ptr);
Support for additional data types (e.g. short, char and floating point) and operations are expected to appear in future releases.
NOTICE: See the end of this section for information on new standardized interfaces which will replace these Berkeley-specific ones in a future release.
void bupc_all_free(shared void *ptr); void bupc_all_lock_free(upc_lock_t *lockptr);
These two functions implement collective alternatives to the standard functions upc_free() and upc_lock_free(), as a convenience to the programmer. Both functions must be called collectively by all threads with the same argument. The object referenced by the argument is guaranteed to remain valid until all threads have entered the collective deallocation call, but the function does not otherwise guarantee any synchronization or strict reference. In all other respects the semantics of these functions and constraints on their usage are identical to their non-collective variants.
NOTICE: The following are slated for inclusion in the upcoming 1.3 revision of the UPC specification, with the same semantics as described above, and are available in Berkeley UPC beginning with the 2.16 release. Users are strongly encouraged to develop new codes using the standardized interfaces and to migrate existing code to them. Once revision 1.3 of the UPC spec is final, the Berkeley-specific versions of these interfaces will become deprecated.
void upc_all_free(shared void *ptr); void upc_all_lock_free(upc_lock_t *lockptr);
The value of errno is zero at program startup, but is never set to zero by any library function. The value of errno may be set to nonzero by a library function call whether or not there is an error, provided the use of errno is not documented in the description of the function in this International Standard.These semantics are actually somewhat weaker than one might hope - specifically, they allow library calls which succeed to change errno to a non-zero value. In practice many C/POSIX library implementation actually do this.
The problem in the context of Berkeley UPC and its source-to-source translation is that there is one copy of errno per UPC thread which is shared by both the generated code representing translated UPC code, and all the runtime libraries running underneath it (including UPCR, GASNet, vendor network libs, etc.). Furthermore, many actions in UPC which do not qualify as library calls at UPC level (e.g. dereferencing a pointer-to-shared) result in library calls within the generated code. Consequently, the value of errno set by a failed library call invoked at the UPC source level may be subsequently overwritten by any of these implicit library calls.
While one could imagine the Berkeley UPC compiler and runtime taking action to preserve the value of errno across all the implicit library calls, doing so would adversely affect performance and we do not currently take this approach. This means that a UPC user who wants to inspect the value of errno after a failed library call they make must do so immediately - not just before the next UPC-level library call, but also before taking any action that might possibly invoke implicit library calls in the generated source code.
Basically, the only 100% safe way for a UPC program to read errno when using Berkeley UPC is to copy it into a local variable immediately after the failed library call returns. This is the "recommended practice" for using errno with Berkeley UPC.
#define NDEBUG #include <assert.h>will not work as expected if the NDEBUG definition modifies the behavior of assert.h (which, in this example, it does: this NDEBUG/assert.h case is the most common case where users run into this issue with our compiler).
There is a simple workaround: if you need to define a macro that affects the behavior of #included files, define it on the command line to upcc:
upcc -DNDEBUG myprogam.upc
Berkeley UPC guarantees that 'getenv()' allows retrieval of certain environment variable values that were present when the job was launched. At present this function is only guaranteed to retrieve these value for all threads if the environment variable's name begins with 'UPC_' or 'GASNET_'. On some platforms all environment variables seen by the job launcher may be propagated, but it is not portable to rely on this.
The 'setenv()' and 'unsetenv' functions are not guaranteed to work in a Berkeley UPC runtime environment, and should be avoided.
If you still believe you are encountering this issue, there are several recommended workarounds:
A future version of BUPC will include a restructuring of the way shared-local accesses are performed at the C code level. This restructuring is motivated by performance concerns, but we expect that as a side-effect it will also workaround this gcc optimizer bug.
UPC Runtime error: pthread_create: Invalid argumentUsers encountering this error are recommended to workaround it by either using the BUPC translator (which does not demonstrate the problem), or reworking their program to use less statically-allocated private data.
upcc -v -Wl,-bmaxdata:0x80000000 foo.c # 'large' upcc -v -Wl,-bmaxdata:0x80000000/dsa foo.c # 'very large'
Users wishing to know more details about this issue are invited to search for AIX "large program support" on Google.
In general, for performance reasons (such as a faster shared pointer implementation), Berkeley UPC users are encouraged to build 64-bit UPC programs on IBM SP systems, in which case these memory segment limits are not an issue. Instructions on how to build a 64-bit Berkeley UPC compiler on the IBM SP are available in the INSTALL.TXT document that comes with the distribution. To see if the copy of upcc you're running is 32/64 bit, check the 'Binary interface' field in the output of 'upcc -version'.
Thank you for using Berkeley UPC!