Lightweight Software Thread-Level Speculation (TLS)

work in collaboration with Alan Mycroft and Tim Harris

Thread-Level Speculation (TLS) is a parallelization technique that allows loop iterations to be executed (optimistically) in parallel even in the presence of cross-iteration dependencies, which are tracked and fixed at runtime.

Related work on software TLS has studied the case when one (overarching) TLS implementation is used to disambiguate dependencies for the whole memory footprint of the application. In this settings good speedup is typically achieved when enough accesses are disambiguated statically by the compiler and only a few other requires TLS support.

In comparison, our software-TLS work has studied in some sense the dual problem that addresses a max-min performance question: ``What speedup can TLS achieve when all accesses require TLS disambiguation?'' In this setting different TLS implementations disambiguate disjoint memory partitions and each TLS implementation is tuned to take advantage of the access patterns of the variables that inhabit the corresponding memory partition.

The implementation of our software-TLS library, named PolyLibTLS, together with some applications parallelized with it can be found at git@github.com:coancea/PolyLibTLS.git and the corresponding papers can be found in the Reference Section [3-6].

This document is organized as follows:

We introduce the notion of Cross-Iteration Dependencies and briefly explain how one can statically resolve them.
We discuss a running example that serves as Motivation for Using TLS Parallelization.
We discuss Thread-Level Speculation (TLS) at a High Level and
We show a Naive Software-TLS Implementation.
We modify the naive TLS version in several key places to obtain a Lightweight Software-TLS Implementation.
We demonstrate how to use the PolyLibTLS Library to speculatively parallelize the running example, and
We conclude with a discussion of the PolyLibTLS performance w.r.t. the running example.
Finally, we list a set of references.

Cross-Iteration Dependencies

carried across iterations

true/flow dependency, a.k.a., read after write (RAW): a memory location is written in an iteration i and read in a following iteration j , i.e., i < j. True dependencies correspond to a producer-consumer relation between iterations, and eliminating them typically require algorithmic changes that are beyond the ability of the compiler.
anti dependency, a.k.a., write-after-read (WAR): the (same) memory location is read in an iteration i and is updated in a following iteration j , i.e., i < j.
output dependency, a.k.a., write after write (WAW): the same memory location is written in two different iterations.


a(1) = 0
DO i = 2, M
  a(i) = a(i-1) + i 
  b(i) = ... b(i+1) ... 
  DO j = 1, N 
    c(j) = ...  
  ENDDO
  ... 
  DO j = 1, N 
    d(j,i) = ... c(i) ...  
  ENDDO 
ENDDO

the access pattern of array a produces cross-iteration true dependencies, because, for example, iteration i=2 writes a(2) and iteration i=3 reads a(2). The computation of a can be parallelized because it has the semantics of a prefix-sum computation with associative operator + , i.e., scan(op +, 0, [2..i]). This is however an algorithmic change that that a compiler will not be typically able to perform in a reliable fashion.
the access pattern of array b produces cross-iteration anti dependencies, because, for example, iteration i=2 reads b(3) and iteration i=3 writes b(3). If executed in out of order, iteration 2 might consume a value produced ``in the future'' by iteration 3, which violates the sequential semantics. This dependency can be resolved via renaming, i.e., copy array b to b_copy and replace the reads from b with the reads from b_copy.
the access pattern of array c produces cross-iteration output dependencies, because any iteration of the outer loop writes (in the first inner loop) the first N elements of array c , i.e., c[1:N]. Furthermore, the second inner loop reads c[1:N] in every iteration of the outermost loop, and gives raise to all kinds of dependencies. At a first view it seem difficult to resolve the many dependencies on c , but this can be relatively-simply achieved via a compiler transformation called privatization: Note that all the reads of an outermost iteration are covered by writes in the same outermost iteration, and as such making each iteration operate of its own private copy of c would preserve the sequential semantics and resolve all dependencies on c . Since all iterations update the same subset of indices in c , then the last iteration updates the global storage, i.e., static last value.

Thread-Level Speculation (TLS) Motivation


// X   ->  indirect array of size N and elements in {0..N-1}
// mat ->  a NxM matrix, i.e., 2-dimensional (full) array of size N*M 
void computeArr2D(int* X, float* arr2d) {

  for(int i=0; i < N; i++) {
    float* row = &mat[ X[i]*M ];
    float  sum = 0.0;

    for(int j=0; j < M; j++) {
      if (j < M-32) {
        sum += (row[j] + row[j + 32]) / (row[j + 32] + 1.0);
      } else {
        sum += (row[j] + row[j - 32]) / (row[j - 32] + 1.0);
      }
    }

    row[0] += sum;
  }
}

X

mat

X

if X is a permutation of {0..N-1} then no cross-iteration dependencies exists,
if X contains only one or two distinct values then frequent cross-iteration dependencies will manifest during execution because multiple iterations will work on the same row,
if X contains random numbers in {0..N-1}, then the occasional cross-iteration dependency exists.

X

strictly monotonic , or
distinct, i.e., X is a permutation of {0,..,N-1}.

sufficient-independence predicates

X

Thread-Level Speculation (TLS) Introduction

LRPD test

X = Y

specST(&X, specLD(&Y,i), i))

&X denotes the address of variable X,
specST(&X, val, i) updates variable X to value val if this does not cause a dependency violation, and otherwise throws an exception, and
specLD(&Y,i) returns the value of Y
i denotes the iteration in which the update/read takes place and is used for bookkeeping.

Naive Software-TLS Implementation

variant

all software TLS implementation

. First, read and writes to data-structures (arrays) that may cause dependency violations under parallel execution are protected with speculative support, i.e., a read/write from array a is re-written via specLD/ST calls:


DO i = 1, N                 DO i = 1, N
  a(b(i)) = a(c(i))   <->      val = specLD(b(i), i)
                               specST(c(i), val, i)
ENDDO                        ENDDO


WORD specLD(int ind, int itNr) {       void specST( int ind, WORD val, int itNr ){  
  LdVct[ind][itNr] = 1;                  ShVct[ind][itNr] = val;
                                  
  int i = highest                        StVct[ind][itNr]  = 1;
          index marked             
          in StVct[ind] <= itNr;         if( exists i > itNr with LdVct[ind][i]==1 )
                                           Mark_Dep_Exc(i);
  if(i>=0) return ShVct[ind][i];        
  else return orig;            
}                                      }

WHERE the original array a has been replaced with the data-structure depicted below: each element of the original a corresponds to a load vector (LdVct), a store vector (StVct), and a shadow vector (ShVct) of size N, where N denotes the number of loop iterations.

A speculative read (specLD) from a(ind):

Records that the current iteration (itNr) has read from position ind, i.e., LdVct[ind][itNr] = 1, and
Searches the store vector to find the highest iteration lower or equal to itself that has written in a(ind).
If such an iteration exists then the corresponding value is returned, otherwise the data is read from global memory (orig)

A speculative write (specST) to a(ind):

Records the to-be-updated value in private (shadow) storage, i.e., ShVct[ind][itNr] = val, then
Records the information that the current iteration has written to a(ind), i.e., StVct[ind][itNr] = 1, and finally
It inspects the load vector (LdVct[ind]) to check that no successor thread (i.e., executing iterations greater than itNr) has read from a(ind). If the latter does not hold, it means that a successor iteration has read an incorrect value and a RAW dependency has been violated => ROLLBACK

Upon ROLLABCK, when the current thread becomes master, it commits its updates to non-speculative storage (i.e., orig field) and clears/resets the speculative metadata of all successor threads. Note that if the current thread does not become master, then there has to be a predecessor thread that has become master and services the rollback.

   The naive implementation above exhibits a large (speculative) memory overhead: O(N^2) memory is used to speculate over an array of size N. The typical optimization is to have only a window of iterations executed concurrently (a small factor of P, the number of processors). This reduces the memory overhead to O(P*N) and the complexity of specLD/ST to O(P), since a speculative load (store) would need to check at most P entries in StVct/LdVct to find the to-be-returned value (or to identify a potential RAW dependency violation).

  Such an implementation has the advantage of being accurate, in that it exhibits no false positive violations and WAW and WAR dependencies are inherently enforced, i.e., they are never the source of dependency violations/rollbacks. There are however several significant shortcomings:

    First, the significant memory overhead, proportional with P*N, where N is the array size and P is the number of processors, may pressure the memory-hierarchy system and may degrade performance.

    Second, the commit phase is serial, since a thread need to become master in order to commit its updates to non-speculative storage. The consequence is that speedup will not scale beyond a fixed number of processors.

    Third, at least the speculative store operation has worst-case complexity O(P), which means, with our example, that we have O(N*P) parallel work and O(N*P/P)=O(N) depth, i.e., asymptotically we do not improve over sequential execution.

Lightweight Software-TLS Implementation

Our work in this context has taken the perspective that rather than employing one over-arching (TLS) implementation for all loops, one can exploit the flexibility of software TLS to engineer a

family of lightweight

can be composed to parallelize the target loop and
can be tuned to take advantage of specific access-patterns of the target loop.

specST

specLD


DO i = 1, N                    b(i) = i       DO i = 1, N
  a(b(i)) = ...              ------------>      a(i) = ...
  ...     = ... a(c(i)) ...    c(i) = i         ...  = a(i)
ENDDO                                         ENDDO

lightweight, in-place TLS implementation, named SpLIP

here

The ``hash'' function maps global memory locations to indexes of the speculative-memory load and store vectors.

To fix the asymptotic time behavior, we maintain only one element in the load/store vectors corresponding to one equivalence class, rather than P as in previous work.

Each thread operates directly on the global memory locations, i.e., in-place updates , but prior to an update it stores the held value in thread private storage ShBuff[p].

In Figure, W denotes the maximal number of per-loop-iteration writes, and ShBuff[p] can be implemented as a vector (of dynamic size).


atomic WORD specLD(        atomic void specST(
  volatile WORD* addr,       volatile WORD* addr, 
  int itNr ) {               WORD new_val, int itNr)   {
1   int i = hash(addr);    1   int i    = hash(addr);
2   if(LdVct[i] itNr) 
4   WORD val = *addr;      4     throw Dep_Exc(itNr-1);
5   if(StVct[i]<=itNr)     5   StVct[i] = itNr;
6      return val;         6   save( addr, *addr, 
7   else throw                       StampVct[i]++ );
8     Dep_Exc(itNr-1);     7   *addr     = new_val;
}                          8   WORD load = LdVct[i];
                           9   if(load > itNr)          
                           10    throw Dep_Ex(itNr-1); }

specLD

addr

itNr

specST

new_val

addr

For each equivalence class i, LdVct/StVct[i] holds the maximal iteration number that has read/written an address belonging to equivalence class i, i.e., hash(addr) = i. It follows that:

For a speculative-load operation: if the value inscribed in StVct[i] is greater than the current iteration performing the write (itNr) then there are two write-operations that have not occurred in program order, i.e., a WAW dependency violation has been discovered.

For a speculative-load operation: if the value inscribed in LdVct[i] is greater than the current iteration performing the write (itNr) then a successor iteration has read an incorrect value, i.e., the successor iteration should have read the value produced by the current iteration, hence a RAW dependency violation has been discovered.

Otherwise, the value hold in addr is saved in thread's private storage (ShBuff), the target address is updated to hold the new value new_val, and specST succeeds.

If a RAW, WAW, WAR violation is discovered then the master thread will service a rollback procedure, in which the updates of all threads (up to and including the master) are rolled back based on the ShBuff information, and speculation is restarted.

The memory space overhead is proportional with the cardinality of the ``hash'' function. Under regular accesses, the memory overhead can be as small as P.

For example, with the loop above, hash(a) = a % P of cardinality P would not introduce any false positive dependencies if iterations {p, p+P, p+2*P, ...} are executed on processor `p', and would provide ideal cache behavior if the elements of LdVct/StVct do not share cache-lines.

The time overhead of specLD/ST is constant, i.e., does not depend on P.

While, for simplicity, the pseudo-code requires specLD/ST operations to execute atomically, the full paper presents an implementation that, for X86 processors, is both lock free , i.e., no CAS instructions, and sequential consistent , i.e., no memory fences.

Obviously, the in-place implementation does not exhibits the non-scalable behavior observed sometimes for implementations that use a serial-commit phase.

in-place

but

dynamic analysis that would compute an effective hash function

here

SpLSC

Using PolyLibTLS Library to Parallelize the Running Example

git@github.com:coancea/PolyLibTLS.git

Reference Section

in a previous section


  // X   ->  indirect array of size N and elements in {0..N-1}
  // mat ->  a NxM matrix, i.e., 2-dimensional (full) array of size N*M 
  for(int i=0; i < N; i++) {
    float* row = &mat[ X[i]*M ];
    float  sum = 0.0;

    for(int j=0; j < M; j++) {
      if (j < M-32) {
        sum += (row[j] + row[j + 32]) / (row[j + 32] + 1.0);
      } else {
        sum += (row[j] + row[j - 32]) / (row[j - 32] + 1.0);
      }
    }

    row[0] += sum;
  }

row

mat

X

mat

row

X

mat

X

The memory is partitioned according to the variables that are used, and dependencies on addresses in each partition are tracked via a TLS implementation that is tuned to the access patterns of the corresponding variables.

A speculative-thread manager is declared. Its task is to coordinate the execution of threads and to implement the rollback-recovery procedure.

The programmer then defines a customized speculative thread that executes a set of consecutive iterations of the original loop, where the problematic accesses are disambiguated via TLS.

Finally, the original loop is replace with glue code that creates the speculative threads, registers them to the thread manager, spawns them and awaits the termination of the speculative code.

Unified Speculation over A Set of Memory Partitions

mat

row

X


const unsigned ENTRY_SIZE   = 1;
const unsigned LDV_SZ       = logN;
const unsigned SHIFT_FACT   = logM;

// 1. HashFunction: hash(a) = ( (a-mat) DIV (2^(SHIFT_FACT+2)) ) MOD (2^LDV_SZ)
typedef HashSeqPow2 Hash;
Hash hash(LDV_SZ, SHIFT_FACT);

// 2. SpLIP protects `mat' and `row'
typedef SpMod_IPcore<Hash>              IPm_core;
IPm_core                                m_core(hash);
typedef SpMod_IP<Hash, &m_core, float>  IPm;
IPm                                     ip_m(ENTRY_SZ);
ip_m.setIntervalAddr(mat, mat+N*M);

// 3. SpecRO protects all the other addresses
typedef SpMod_ReadOnly<int> SpRO;
SpRO    ro_m;

// 4. Construct the Unified Speculative Memory
typedef USpModel<   IPm, &ip_m, IP_ATTR,
                    USpModel<SpRO, &ro_m, 0, USpModelFP>
                >
        UMeg;

UMeg umm;

The idea is to put all addresses corresponding to a row into the same equivalence class, because an outer-iteration accesses elements in the same row. It follows that a speculative read/write on any element of the row uses the same entry in the speculative storage, e.g., load/store/shadow vectors. The cardinality of the ``hash'' function is N, i.e., the number of rows:
hash : [mat, &mat+N*M) -> [0,N-1],
hash(a) = ( (a-mat) div (4*M) ) mod N
where `div' and `mod' stand for integer division and modulo, respectively, and mat corresponds to the start address of the matrix. Class HashSeqPow2 uses fast arithmetic, hence the use of logarithm in the code.

We use the previous step to build a SpLIP instance that is tuned to the access patterns of mat:
- IPm_core exports the basic implementation of the speculative load and store operations and its instance m_core uses the hash function to cluster accesses,
- IPm extends the core functionality with support for private buffering and rollback recovery, e.g., ENTRY_SIZE=1 corresponds to the size of the thread-private-write buffer, which is an overestimate of the number of updates performed by an iteration.
- Finally, ip_m.setIntervalAddr(...) sets the lower and upper bounds of the global-memory partition protected by the current TLS instance.

Similarly an instance of the SpecRO model, named ro_m is created. The intent is that it will protect any address that does not fall in the range of address of the matrix mat.

Finally, a ``unified-speculative instance'' is built, i.e., UMeg. Combining the TLS instances is achieved by creating a chain of USpModel instantiations, in which USpModelFP denotes the end of the chain, and the previous-innermost TLS instance is the default one. By ``default'' we mean that it protects the memory space that was not covered by the previous partitions.
- If a 'specLD/ST' is invoked directly on a TLS instance and it is not within the corresponding memory-partition bounds, then it causes a rollback.
- If a 'specLD/ST' is invoked on the "unified" instance 'umm' then the memory partitions are checked in the order in which they have been defined in 'USpModel', until the right one is find, where the last instance is the default one, i.e., covers the rest of the memory.

Important Note

Thread Manager

Master_AbstractTM

The first type parameter is the ``self'' (currently declared) type, i.e. ThManager, and allows to specialize (if needed) the default implementation of Master_AbstractTM.

The second and third parameters are the unified-speculative-memory partitions type and instance, i.e., umm of type UMeg. They are needed for the implementation of the rollback-recovery procedure.

The fourth parameter is the number of speculative threads that are used for loop parallelization, and the fifth parameter is not currently used, hence 0.

The last type parameter provides additional instructions about how to parallelize the code. For example, NormalizedChunks<ThManager, UNROLL> means: stripmine the original loop by an UNROLL factor, where the resulted big iterations are dynamically scheduled for execution on the speculative threads. In our case UNROLL is 1 because the outer iteration has enough granularity, i.e., it executes an inner loop of count M.


const int NUM_THREADS   = 6;
const int UNROLL     	= 1;

class ThManager : public
	Master_AbstractTM<ThManager, UMeg, &umm, NUM_THREADS, 0, NormalizedChunks<ThManager, UNROLL> > {
  public:
	inline ThManager() {
		Master_AbstractTM<ThManager, UMeg, &umm, NUM_THREADS, 0, NormalizedChunks<ThManager, UNROLL> >::init();
	}
};
ThManager 				ttmm;

Speculative Thread Interface

Thread_Max

SpecMasterThread

Thread_Max

ThManager

UMeg

The constructor Thread_Max (which should also be hidden from being used directly by the programmer)

Initialization for the variables used inside the loop, e.g., initVars initializes j to 0. Here j denotes the outermost loop index.

The test for the end-of-loop condition, i.e., end_condition return true if the outermost loop index j becomes greater or equal to the loop count N.

updateInductionVars computes the outermost loop index this->j corresponding to the start of a strip-mined (unrolled) iteration this->id. Note that the code is a bit awkward because for obscure reasons the numbering of big iterations starts from NUM_THREADS rather than from 0, and as such one has to take the difference id - ttmm.firstID() when computing j.

iteration_body implements the body of the original loop where the ``normal'' loads and stores from/to global memory have been replaced with calls to the high-level functions specLD/ST that implement TLS's dependency tracking.

Finally, non_spec_iteration_body is used by the thread manager when servicing a rollback, and it is supposed to be identical with iteration_body except that it uses non-speculative loads and stores, i.e., direct access to global memory.


class Thread_Max : public SpecMasterThread< Thread_Max, UMeg, ThManager >
{
private:
    typedef SpecMasterThread< Thread_Max, UMeg, ThManager >	MASTER_TH;
    unsigned long j;

public:
    inline Thread_Max(const unsigned long it, unsigned long* dummy) : MASTER_TH(UNROLL,it,dummy){ }

    inline void initVars     ()       { this->j = 0; }
    inline int  end_condition() const { return (this->j >= N); }
    inline void updateInductionVars() { this->j = (this->id - ttmm.firstID()) * UNROLL; }

    inline int iteration_body() {  
        unsigned row_nr = umm.specLDslow(&X[this->j], this); 
        float* row = &mat[row_nr*M];

        float  sum = 0.0;
        for(int i=0; i<M; i++) {
            sum += (i < M-32) ? 
                ( this->specLD<IPm,&ip_m>(&row[i]) + this->specLD<IPm,&ip_m>(&row[i + 32]) ) / 
                ( this->specLD<IPm,&ip_m>(&row[i + 32]) + 1.0 )     : 
                ( this->specLD<IPm,&ip_m>(&row[i]) + this->specLD<IPm,&ip_m>(&row[i - 32]) ) / 
                ( this->specLD<IPm, &ip_m>(&row[i - 32]) + 1.0)     ;
        }
        this->specST<IPm, &ip_m>(&row[0], sum);

        this->j++;
    }

    inline int non_spec_iteration_body() { 
        /* same as iteration_body but with normal read/write accesses */ 
    }
};

reading X[j] is performed via a call to the unified speculative-memory model umm, i.e., umm.specLDslow(&X[this->j],this). This call checks the ranges of all memory partitions until it finds one that fits and services the speculative load operation via the TLS instance that is associated with that memory partition. In our case that is a SpecRO implementation.

Read and write accesses to the matrix elements are performed via specLD/ST operations that are implemented at the speculative-thread level and which are optimistically dispatched to the TLS implementation indicated by the type parameters of the call, e.g., specLD<IPm, &ip_m>(&row[i]). This optimizes the dispatch time, but if the read address does not belong to the specified partition then a false-positive dependency violation will be signaled.

Starting and Terminating Speculative Execution

to create the speculative threads, i.e., allocateThread<SpecTh>(i, 64), where istands for the initial thread identifier, and 64 is denotes the padding size
to register the speculative threads to the thread manager, i.e., ttmm.registerSpecThread(thr,i), and
to ask the thread manager to execute the speculative code, i.e., ttmm.speculate<Thread_Max>();


    for (int i = 0; i < NUM_THREADS; i++) {
        SpecTh* thr = allocateThread<Thread_Max>(i, 64);
        thr->initVars();
        ttmm.registerSpecThread(thr,i);
    }

    gettimeofday(&start, NULL);
        ttmm.speculate<Thread_Max>();
    gettimeofday(&end, NULL);
    running_time = DiffTime(&end_time,&start_time);

Performance Discussion

If MODEL_LOCK is defined then specLD/ST uses compare-and-swap (CAS) instructions to implement load/store atomicity (spinlock).
If MODEL_FENCE is defined then specLD/ST uses a CAS-free implementation, i.e., no locks, but memory fences are required.
If MACHINE_X86_64 is defined then specLD/ST uses neither CAS nor memory fences, but may introduce the occasional false-positive dependency. Note that correctness is guaranteed only for X86 machines, which implement the least relaxed memory-sequential consistent model.

X

MACHINE_X86_64

MODEL_LOCK

MODEL_FENCE

X

#define MACHINE_X86_64

the non-speculative sequential time is 1787946 ns,
the speculative time on 1 thread is 6156013 ns, and
the speculative time on 6 thread is 1368638 ns.

References

L. Rauchwerger and D. Padua. "The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization and Reduction Parallelization", IEEE Trans. on Parallel and Distributed System, 10 No 2(2), pp 160-199, Feb 1999.
P. Rundberg and P. Stenstrom. "An All-Software Thread-Level Data Dependence Speculation System for Multiprocessors.", Journal of Instruction-Level Parallelism, 1999.
Cosmin E. Oancea, Alan Mycroft and Tim Harris. "A Lightweight In-Place Implementation for Software Thread-Level Speculation", 21st ACM Symposium on Parallelism in Algorithms and Architectures (SPAA'09), August 2009, Calgary, Canada. PDF
Cosmin E. Oancea, Alan Mycroft and Stephen M. Watt. "A New Approach to Parallelising Tracing Algorithms", International Symposium on Memory Management (ISMM'09), June 2009, Dublin, Ireland. PDF
Cosmin E. Oancea and Alan Mycroft. "Set-Congruence Dynamic Analysis for Software Thread-Level Speculation", 21st Int. Lang. and Compilers for Parallel Computing (LCPC'08), August 2008, Edmonton, Canada. PDF
Cosmin E. Oancea and Alan Mycroft. "Software Thread-Level Speculation: an Optimistic Library Implementation", International Workshop on Multicore Software Engineering (IWMSE'08), pp 23-32 (ACM Digital Library), May 2008, Leipzig, Germany. PDF
Cosmin E. Oancea and Lawrence Rauchwerger. "A Hybrid Approach to Proving Memory Reference Monotonicity", 24th Int. Lang. and Compilers for Parallel Computing (LCPC'11), LNCS, Vol 7146, pp 61-75, Sept 2013. PDF
Cosmin E. Oancea and Lawrence Rauchwerger. "Logical Inference Techniques for Loop Parallelization", 33rd ACM-SIGPLAN Conf. on Prog. Lang. Design and Implem. (PLDI'12), pp 509-520, June 2012. PDF

Lightweight Software Thread-Level Speculation (TLS)

work in collaboration with Alan Mycroft and Tim Harris

We introduce the notion of Cross-Iteration Dependencies and briefly explain how one can statically resolve them.

We discuss a running example that serves as Motivation for Using TLS Parallelization.

We discuss Thread-Level Speculation (TLS) at a High Level and

We show a Naive Software-TLS Implementation.

We modify the naive TLS version in several key places to obtain a Lightweight Software-TLS Implementation.

We demonstrate how to use the PolyLibTLS Library to speculatively parallelize the running example, and

We conclude with a discussion of the PolyLibTLS performance w.r.t. the running example.

Finally, we list a set of references.

Cross-Iteration Dependencies

Thread-Level Speculation (TLS) Motivation

Thread-Level Speculation (TLS) Introduction

Naive Software-TLS Implementation

Lightweight Software-TLS Implementation

Using PolyLibTLS Library to Parallelize the Running Example

Unified Speculation over A Set of Memory Partitions

Thread Manager

Speculative Thread Interface

Starting and Terminating Speculative Execution

Performance Discussion

References