• Uncategorized

About c : write-syscall-gets-considerably-slower-when-called-from-another-thread

Question Detail

If you define THR, the code will do the same job but just in another thread. I only measured the time spent on the write call.

Running the code with ./some-file-name>/dev/null, this is the result I get, which is the accumulated clock cycles.

THR not defined

   1    48930106
   2    43946464
   3    44669126
   4    45918011
   5    44108477
   6    43608789
   7    45104427
   8    49676889
   9    44682305
  10    47516931

THR defined

   1   108347418
   2   101670307
   3   101726085
   4   100531554
   5   100137343
   6    85837022
   7   105556754
   8   104681843
   9   110303338
  10   104666783

Why is write when called from another thread so much slower?

The system is Fedora Linux.

#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#include <pthread.h>
#include <semaphore.h>
#include <fcntl.h>
#include <unistd.h>
#include <immintrin.h>
#ifdef __cplusplus
#include <atomic>
using namespace std;
#include <stdatomic.h>

#define SIZE 0x100000

static unsigned long long rdtscp() {
    unsigned _;
    return __rdtscp(&_);

static char b[SIZE];
static atomic_ullong oc;
#ifdef THR
static sem_t s[2];

void *out(void *_) {
    for (;;) {
        unsigned long long c = rdtscp();
        write(1, b, SIZE);
        oc += rdtscp() - c;
        sem_post(s + 1);
    return _;

int main() {
    memset(b, 'a', SIZE);
#ifdef THR
    sem_init(s, false, 0);
    sem_init(s + 1, false, 0);
    pthread_t t;
    pthread_create(&t, NULL, out, NULL);
    for (int i = 1;; ++i) {
    #ifdef THR
        sem_wait(s + 1);
        unsigned long long c = rdtscp();
        write(1, b, SIZE);
        oc += rdtscp() - c;
        const int d = 100000;
        if (!(i % d)) {
            unsigned long long _oc = atomic_exchange(&oc, 0);
            fprintf(stderr, "%4d%12llu\n", i / d, _oc);

Not sure if this is okay, but I made the code both compile in C and C++ to add the C++ tag. I will roll back if this is inappropriate.

Question Answer

In one case, the data to be written is hot in the core’s cache. There’s no synchronization or dispatch overhead.

In the other case, the data to be written is modified in some other core’s cache. There’s both synchronization and dispatch overhead.

It’s not surprising it takes more clock cycle to process the data when caches are cold, cores have to be synchronized, and a new task has to be dispatched rather than an existing task simply continuing.

A common anti-pattern is to convey data from thread to thread rather than processing it to completion in a single thread.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.