Harnessing Multi-Core Power: A Complete Guide to Parallel Programming in C

The era of ever-increasing single-core clock speeds has ended. Today's processors pack multiple cores, and fully utilizing them requires parallel programming techniques. C, with its direct access to hardware and low-level control, provides powerful tools for parallel computing. This comprehensive guide explores the landscape of parallel programming in C, from pthreads to OpenMP, MPI to GPU computing.

The Parallel Computing Landscape

┌─────────────────────────────────────────────────────────────────┐
│                     Parallel Programming Models                   │
├─────────────────────────────────────────────────────────────────┤
│  Shared Memory    │  Distributed Memory  │  Accelerator/GPU     │
│  ───────────────  │  ──────────────────  │  ────────────────    │
│  • pthreads       │  • MPI               │  • CUDA              │
│  • OpenMP         │  • PGAS              │  • OpenCL            │
│  • C11 threads    │  • UPC               │  • SYCL              │
└─────────────────────────────────────────────────────────────────┘

1. Shared Memory Parallelism with Pthreads

Pthreads (POSIX threads) is the foundation of shared-memory parallel programming in C.

Basic Thread Creation

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#define NUM_THREADS 4
typedef struct {
int id;
int start;
int end;
double* data;
double result;
} ThreadData;
void* worker(void* arg) {
ThreadData* data = (ThreadData*)arg;
double sum = 0.0;
for (int i = data->start; i < data->end; i++) {
sum += data->data[i];
}
data->result = sum;
return NULL;
}
int main() {
pthread_t threads[NUM_THREADS];
ThreadData thread_data[NUM_THREADS];
double data[1000000];
int chunk_size = 1000000 / NUM_THREADS;
// Initialize data
for (int i = 0; i < 1000000; i++) {
data[i] = i * 1.0;
}
// Create threads
for (int i = 0; i < NUM_THREADS; i++) {
thread_data[i].id = i;
thread_data[i].start = i * chunk_size;
thread_data[i].end = (i == NUM_THREADS - 1) ? 1000000 : (i + 1) * chunk_size;
thread_data[i].data = data;
pthread_create(&threads[i], NULL, worker, &thread_data[i]);
}
// Wait for completion
double total = 0.0;
for (int i = 0; i < NUM_THREADS; i++) {
pthread_join(threads[i], NULL);
total += thread_data[i].result;
}
printf("Total: %f\n", total);
return 0;
}

Parallel Reduction with Atomic Operations

#include <pthread.h>
#include <stdatomic.h>
atomic_int global_counter = 0;
void* atomic_worker(void* arg) {
int* local_data = (int*)arg;
for (int i = 0; i < 10000; i++) {
atomic_fetch_add(&global_counter, local_data[i]);
}
return NULL;
}
// Using mutex for complex operations
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
double global_sum = 0.0;
void* mutex_worker(void* arg) {
double* local_data = (double*)arg;
double local_sum = 0.0;
// Compute local sum
for (int i = 0; i < 10000; i++) {
local_sum += local_data[i];
}
// Critical section
pthread_mutex_lock(&mutex);
global_sum += local_sum;
pthread_mutex_unlock(&mutex);
return NULL;
}

2. OpenMP: High-Level Shared Memory Parallelism

OpenMP provides compiler directives that make parallel programming much simpler.

Basic OpenMP Example

#include <omp.h>
#include <stdio.h>
int main() {
#pragma omp parallel num_threads(4)
{
int thread_id = omp_get_thread_num();
printf("Hello from thread %d\n", thread_id);
}
return 0;
}

Parallel For Loop

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
void vector_add(double* a, double* b, double* c, int n) {
#pragma omp parallel for
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
void matrix_multiply(int n, double* a, double* b, double* c) {
#pragma omp parallel for collapse(2)
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
double sum = 0.0;
for (int k = 0; k < n; k++) {
sum += a[i * n + k] * b[k * n + j];
}
c[i * n + j] = sum;
}
}
}

Parallel Reduction

double compute_sum(double* array, int n) {
double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < n; i++) {
sum += array[i];
}
return sum;
}
// Finding maximum with reduction
double find_max(double* array, int n) {
double max_val = array[0];
#pragma omp parallel for reduction(max:max_val)
for (int i = 0; i < n; i++) {
if (array[i] > max_val) {
max_val = array[i];
}
}
return max_val;
}

Scheduling Strategies

void parallel_scheduling_examples(double* array, int n) {
// Static scheduling - divide evenly
#pragma omp parallel for schedule(static, 1000)
for (int i = 0; i < n; i++) {
process(array[i]);
}
// Dynamic scheduling - assign chunks as threads become free
#pragma omp parallel for schedule(dynamic, 100)
for (int i = 0; i < n; i++) {
process(array[i]);
}
// Guided scheduling - decreasing chunk sizes
#pragma omp parallel for schedule(guided)
for (int i = 0; i < n; i++) {
process(array[i]);
}
// Runtime - determined by environment variable OMP_SCHEDULE
#pragma omp parallel for schedule(runtime)
for (int i = 0; i < n; i++) {
process(array[i]);
}
}

Sections for Task Parallelism

void parallel_sections() {
#pragma omp parallel sections
{
#pragma omp section
{
printf("Task 1 running on thread %d\n", omp_get_thread_num());
compute_function1();
}
#pragma omp section
{
printf("Task 2 running on thread %d\n", omp_get_thread_num());
compute_function2();
}
#pragma omp section
{
printf("Task 3 running on thread %d\n", omp_get_thread_num());
compute_function3();
}
}
}

Task-Based Parallelism

#include <omp.h>
void fibonacci_task(int n, int* result) {
if (n <= 1) {
*result = n;
return;
}
int a, b;
#pragma omp task shared(a)
fibonacci_task(n - 1, &a);
#pragma omp task shared(b)
fibonacci_task(n - 2, &b);
#pragma omp taskwait
*result = a + b;
}
int fibonacci(int n) {
int result;
#pragma omp parallel
{
#pragma omp single
fibonacci_task(n, &result);
}
return result;
}

3. Distributed Memory Parallelism with MPI

MPI (Message Passing Interface) is the standard for distributed memory programming across multiple nodes.

Basic MPI Example

#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Hello from process %d of %d\n", rank, size);
MPI_Finalize();
return 0;
}

Point-to-Point Communication

#include <mpi.h>
#include <stdio.h>
void ping_pong() {
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int token = 0;
if (rank == 0) {
// Send to process 1
MPI_Send(&token, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
printf("Process 0 sent token %d\n", token);
// Receive from process 1
MPI_Recv(&token, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process 0 received token %d\n", token);
} else if (rank == 1) {
// Receive from process 0
MPI_Recv(&token, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process 1 received token %d\n", token);
// Increment and send back
token++;
MPI_Send(&token, 1, MPI_INT, 0, 0, MPI_COMM_WORLD);
printf("Process 1 sent token %d\n", token);
}
}

Collective Communication

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
void collective_example() {
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int local_data = rank;
int global_sum;
int global_max;
int global_min;
// Reduce: sum across all processes
MPI_Reduce(&local_data, &global_sum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
// Allreduce: everyone gets the sum
MPI_Allreduce(&local_data, &global_max, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);
// Gather data from all processes
int* all_data = NULL;
if (rank == 0) {
all_data = malloc(size * sizeof(int));
}
MPI_Gather(&local_data, 1, MPI_INT, all_data, 1, MPI_INT, 0, MPI_COMM_WORLD);
// Broadcast from root
int broadcast_value;
if (rank == 0) {
broadcast_value = 42;
}
MPI_Bcast(&broadcast_value, 1, MPI_INT, 0, MPI_COMM_WORLD);
// Scatter array
int* scatter_data = NULL;
int recv_data;
if (rank == 0) {
scatter_data = malloc(size * sizeof(int));
for (int i = 0; i < size; i++) {
scatter_data[i] = i * 10;
}
}
MPI_Scatter(scatter_data, 1, MPI_INT, &recv_data, 1, MPI_INT, 0, MPI_COMM_WORLD);
if (rank == 0) {
printf("Global sum: %d\n", global_sum);
printf("Global max: %d\n", global_max);
printf("All data: ");
for (int i = 0; i < size; i++) {
printf("%d ", all_data[i]);
}
printf("\n");
free(all_data);
free(scatter_data);
}
printf("Process %d: broadcast received %d, scatter received %d\n",
rank, broadcast_value, recv_data);
}

Parallel Matrix Multiplication with MPI

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#define N 1000
void parallel_matrix_multiply() {
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int rows_per_proc = N / size;
int* a = NULL;
int* b = NULL;
int* c_local = malloc(rows_per_proc * N * sizeof(int));
if (rank == 0) {
// Initialize matrices on root process
a = malloc(N * N * sizeof(int));
b = malloc(N * N * sizeof(int));
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
a[i * N + j] = i + j;
b[i * N + j] = i - j;
}
}
}
// Scatter rows of A to all processes
int* a_local = malloc(rows_per_proc * N * sizeof(int));
MPI_Scatter(a, rows_per_proc * N, MPI_INT,
a_local, rows_per_proc * N, MPI_INT,
0, MPI_COMM_WORLD);
// Broadcast B to all processes
if (rank != 0) {
b = malloc(N * N * sizeof(int));
}
MPI_Bcast(b, N * N, MPI_INT, 0, MPI_COMM_WORLD);
// Local computation
for (int i = 0; i < rows_per_proc; i++) {
for (int j = 0; j < N; j++) {
int sum = 0;
for (int k = 0; k < N; k++) {
sum += a_local[i * N + k] * b[k * N + j];
}
c_local[i * N + j] = sum;
}
}
// Gather results
int* c = NULL;
if (rank == 0) {
c = malloc(N * N * sizeof(int));
}
MPI_Gather(c_local, rows_per_proc * N, MPI_INT,
c, rows_per_proc * N, MPI_INT,
0, MPI_COMM_WORLD);
// Cleanup
free(a_local);
free(c_local);
if (rank == 0) {
free(a);
free(b);
free(c);
} else {
free(b);
}
}

4. GPU Computing with CUDA

CUDA enables parallel computing on NVIDIA GPUs.

Basic CUDA Kernel

#include <cuda_runtime.h>
#include <stdio.h>
// Kernel function - runs on GPU
__global__ void vector_add(float* a, float* b, float* c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
int main() {
int n = 1000000;
size_t bytes = n * sizeof(float);
// Host arrays
float* h_a = (float*)malloc(bytes);
float* h_b = (float*)malloc(bytes);
float* h_c = (float*)malloc(bytes);
// Initialize host arrays
for (int i = 0; i < n; i++) {
h_a[i] = i;
h_b[i] = i * 2;
}
// Device arrays
float* d_a, *d_b, *d_c;
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);
// Copy to device
cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);
// Launch kernel
int threads_per_block = 256;
int blocks_per_grid = (n + threads_per_block - 1) / threads_per_block;
vector_add<<<blocks_per_grid, threads_per_block>>>(d_a, d_b, d_c, n);
// Copy result back
cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);
// Cleanup
free(h_a); free(h_b); free(h_c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

Shared Memory Optimization

// Matrix multiplication with shared memory
__global__ void matrix_multiply_shared(float* a, float* b, float* c,
int n, int m, int k) {
extern __shared__ float shared[];
float* shared_a = shared;
float* shared_b = &shared[n * blockDim.x];
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0f;
for (int tile = 0; tile < (k + blockDim.x - 1) / blockDim.x; tile++) {
// Load into shared memory
if (row < n && (tile * blockDim.x + threadIdx.x) < k) {
shared_a[threadIdx.y * blockDim.x + threadIdx.x] =
a[row * k + tile * blockDim.x + threadIdx.x];
} else {
shared_a[threadIdx.y * blockDim.x + threadIdx.x] = 0.0f;
}
if (col < m && (tile * blockDim.x + threadIdx.y) < k) {
shared_b[threadIdx.y * blockDim.x + threadIdx.x] =
b[(tile * blockDim.x + threadIdx.y) * m + col];
} else {
shared_b[threadIdx.y * blockDim.x + threadIdx.x] = 0.0f;
}
__syncthreads();
// Compute partial sum
for (int i = 0; i < blockDim.x; i++) {
sum += shared_a[threadIdx.y * blockDim.x + i] *
shared_b[i * blockDim.x + threadIdx.x];
}
__syncthreads();
}
if (row < n && col < m) {
c[row * m + col] = sum;
}
}

Streams for Concurrent Execution

void concurrent_execution() {
int n = 1000000;
size_t bytes = n * sizeof(float);
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
float *h_a1, *h_a2, *d_a1, *d_a2, *d_b1, *d_b2, *h_c1, *h_c2;
// Allocate memory
cudaMallocHost(&h_a1, bytes);
cudaMallocHost(&h_a2, bytes);
cudaMallocHost(&h_c1, bytes);
cudaMallocHost(&h_c2, bytes);
cudaMalloc(&d_a1, bytes);
cudaMalloc(&d_a2, bytes);
cudaMalloc(&d_b1, bytes);
cudaMalloc(&d_b2, bytes);
// Initialize data
for (int i = 0; i < n; i++) {
h_a1[i] = i;
h_a2[i] = i * 2;
}
// Asynchronous operations on stream1
cudaMemcpyAsync(d_a1, h_a1, bytes, cudaMemcpyHostToDevice, stream1);
vector_add<<<blocks_per_grid, threads_per_block, 0, stream1>>>(d_a1, d_b1, d_a1, n);
cudaMemcpyAsync(h_c1, d_a1, bytes, cudaMemcpyDeviceToHost, stream1);
// Asynchronous operations on stream2
cudaMemcpyAsync(d_a2, h_a2, bytes, cudaMemcpyHostToDevice, stream2);
vector_add<<<blocks_per_grid, threads_per_block, 0, stream2>>>(d_a2, d_b2, d_a2, n);
cudaMemcpyAsync(h_c2, d_a2, bytes, cudaMemcpyDeviceToHost, stream2);
// Wait for both streams
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);
// Cleanup
cudaFreeHost(h_a1); cudaFreeHost(h_a2); cudaFreeHost(h_c1); cudaFreeHost(h_c2);
cudaFree(d_a1); cudaFree(d_a2); cudaFree(d_b1); cudaFree(d_b2);
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
}

5. Task-Based Parallelism with Intel TBB

Intel Threading Building Blocks (TBB) provides high-level task-based parallelism.

#include <tbb/task_group.h>
#include <tbb/parallel_for.h>
#include <tbb/parallel_reduce.h>
#include <tbb/blocked_range.h>
#include <iostream>
using namespace tbb;
// Parallel for
void parallel_for_example(float* a, float* b, float* c, int n) {
parallel_for(blocked_range<int>(0, n),
[=](const blocked_range<int>& r) {
for (int i = r.begin(); i != r.end(); i++) {
c[i] = a[i] + b[i];
}
}
);
}
// Parallel reduce
float parallel_sum(float* array, int n) {
return parallel_reduce(
blocked_range<int>(0, n),
0.0f,
[=](const blocked_range<int>& r, float init) -> float {
float sum = init;
for (int i = r.begin(); i != r.end(); i++) {
sum += array[i];
}
return sum;
},
[](float x, float y) { return x + y; }
);
}
// Task graph
void task_graph_example() {
task_group g;
int a = 0, b = 0, c = 0;
g.run([&] { a = compute_a(); });
g.run([&] { b = compute_b(); });
g.run_and_wait([&] {
// Waits for a and b to complete
c = a + b;
});
g.run([&] { compute_d(c); });
g.wait();
}

6. Performance Optimization Techniques

False Sharing Prevention

// Bad: false sharing
struct BadData {
int counter1;
int counter2;  // Likely in same cache line
};
// Good: padded to avoid false sharing
struct GoodData {
int counter1;
char padding[64];  // Cache line size typically 64 bytes
int counter2;
};
// Or use aligned allocation
#include <stdalign.h>
alignas(64) struct PaddedData {
int counter;
};

Work Stealing Scheduler

#include <pthread.h>
#include <stdlib.h>
typedef struct Task {
void (*func)(void*);
void* arg;
struct Task* next;
} Task;
typedef struct {
Task** queues;
int num_threads;
volatile int* queue_sizes;
volatile int* queue_heads;
volatile int* queue_tails;
} WorkStealingQueue;
// Work stealing algorithm
void* worker_loop(void* arg) {
int tid = *(int*)arg;
WorkStealingQueue* ws = (WorkStealingQueue*)arg;
while (1) {
Task* task = NULL;
// Try to get from own queue
if (ws->queue_heads[tid] < ws->queue_tails[tid]) {
task = ws->queues[tid][ws->queue_heads[tid]++];
} else {
// Try to steal from other threads
for (int victim = 0; victim < ws->num_threads; victim++) {
if (victim == tid) continue;
if (ws->queue_heads[victim] < ws->queue_tails[victim]) {
task = ws->queues[victim][ws->queue_heads[victim]++];
break;
}
}
}
if (task) {
task->func(task->arg);
free(task);
} else {
// No work available
break;
}
}
return NULL;
}

7. Debugging Parallel Programs

Using ThreadSanitizer

# Compile with ThreadSanitizer
gcc -fsanitize=thread -g -pthread program.c -o program
# Run
./program

Valgrind Helgrind

valgrind --tool=helgrind ./program

Debugging with GDB

# Set breakpoints on specific threads
(gdb) break main thread 2
(gdb) thread apply all bt
(gdb) thread 3
(gdb) info threads

8. Complete Example: Parallel Mandelbrot Set

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <math.h>
#define WIDTH 800
#define HEIGHT 600
#define MAX_ITER 1000
#define NUM_THREADS 4
typedef struct {
int start_row;
int end_row;
double x_min;
double x_max;
double y_min;
double y_max;
int* output;
} MandelbrotData;
int mandelbrot(double x0, double y0) {
double x = 0.0, y = 0.0;
int iter = 0;
while (x*x + y*y <= 4.0 && iter < MAX_ITER) {
double xtemp = x*x - y*y + x0;
y = 2*x*y + y0;
x = xtemp;
iter++;
}
return iter;
}
void* compute_mandelbrot(void* arg) {
MandelbrotData* data = (MandelbrotData*)arg;
for (int row = data->start_row; row < data->end_row; row++) {
double y = data->y_min + (double)row / HEIGHT * (data->y_max - data->y_min);
for (int col = 0; col < WIDTH; col++) {
double x = data->x_min + (double)col / WIDTH * (data->x_max - data->x_min);
int iter = mandelbrot(x, y);
// Convert iteration count to color (grayscale)
int color = (int)((double)iter / MAX_ITER * 255);
data->output[row * WIDTH + col] = color;
}
}
return NULL;
}
int main() {
int* image = malloc(WIDTH * HEIGHT * sizeof(int));
MandelbrotData data[NUM_THREADS];
pthread_t threads[NUM_THREADS];
// Mandelbrot set range
double x_min = -2.0, x_max = 1.0;
double y_min = -1.5, y_max = 1.5;
int rows_per_thread = HEIGHT / NUM_THREADS;
// Create threads
for (int i = 0; i < NUM_THREADS; i++) {
data[i].start_row = i * rows_per_thread;
data[i].end_row = (i == NUM_THREADS - 1) ? HEIGHT : (i + 1) * rows_per_thread;
data[i].x_min = x_min;
data[i].x_max = x_max;
data[i].y_min = y_min;
data[i].y_max = y_max;
data[i].output = image;
pthread_create(&threads[i], NULL, compute_mandelbrot, &data[i]);
}
// Wait for completion
for (int i = 0; i < NUM_THREADS; i++) {
pthread_join(threads[i], NULL);
}
// Save image (simple PGM format)
FILE* f = fopen("mandelbrot.pgm", "wb");
fprintf(f, "P2\n%d %d\n255\n", WIDTH, HEIGHT);
for (int i = 0; i < WIDTH * HEIGHT; i++) {
fprintf(f, "%d ", image[i]);
if ((i + 1) % WIDTH == 0) fprintf(f, "\n");
}
fclose(f);
free(image);
printf("Mandelbrot image saved to mandelbrot.pgm\n");
return 0;
}

Best Practices Summary

  1. Identify parallelism: Use Amdahl's law to understand potential speedup
  2. Minimize synchronization: Reduce critical sections and lock contention
  3. Use appropriate granularity: Too fine-grained adds overhead, too coarse underutilizes cores
  4. Consider data locality: Keep data close to the threads that use it
  5. Profile and measure: Use tools to find bottlenecks
  6. Test for race conditions: Use ThreadSanitizer and Helgrind
  7. Scale gracefully: Ensure performance scales with core count
  8. Document assumptions: Thread-safety, data sharing, synchronization

Common Parallel Patterns

PatternDescriptionUse Case
MapApply function to each elementVector operations
ReduceCombine elementsSum, max, min
ScanPrefix operationsCumulative sums
StencilCompute based on neighborsImage processing
Fork-JoinSplit and merge tasksRecursive algorithms
PipelineStream processingSignal processing

Conclusion

Parallel programming in C offers tremendous performance gains but requires careful design and implementation. The tools and techniques presented—from pthreads and OpenMP for shared memory to MPI for distributed systems and CUDA for GPU acceleration—provide a comprehensive toolkit for tackling parallel problems.

Key takeaways:

  • Choose the right model: Shared memory for single node, distributed for clusters
  • Start simple: Use OpenMP for shared memory, then optimize
  • Think about data: Data decomposition is as important as task decomposition
  • Measure performance: Speedup isn't guaranteed—profile and verify
  • Handle complexity: Parallel bugs are subtle—use tools to detect them

With the end of Dennard scaling and the rise of multi-core processors, parallel programming is no longer optional for performance-critical applications. By mastering these techniques, you can write C programs that fully harness the power of modern hardware.

Leave a Reply

Your email address will not be published. Required fields are marked *


Macro Nepal Helper