• Uncategorized

About bash : Running-a-limited-number-of-child-processes-in-parallel-in-bash-duplicate

Question Detail

I have a large set of files for which some heavy processing needs to be done.
This processing in single threaded, uses a few hundred MiB of RAM (on the machine used to start the job) and takes a few minutes to run.
My current usecase is to start a hadoop job on the input data, but I’ve had this same problem in other cases before.

In order to fully utilize the available CPU power I want to be able to run several those tasks in paralell.

However a very simple example shell script like this will trash the system performance due to excessive load and swapping:

find . -type f | while read name ; 
do 
   some_heavy_processing_command ${name} &
done

So what I want is essentially similar to what “gmake -j4” does.

I know bash supports the “wait” command but that only waits untill all child processes have completed. In the past I’ve created scripting that does a ‘ps’ command and then grep the child processes out by name (yes, i know … ugly).

What is the simplest/cleanest/best solution to do what I want?


Edit: Thanks to Frederik: Yes indeed this is a duplicate of How to limit number of threads/sub-processes used in a function in bash
The “xargs –max-procs=4” works like a charm.
(So I voted to close my own question)

Question Answer

I know I’m late to the party with this answer but I thought I would post an alternative that, IMHO, makes the body of the script cleaner and simpler. (Clearly you can change the values 2 & 5 to be appropriate for your scenario.)

function max2 {
   while [ `jobs | wc -l` -ge 2 ]
   do
      sleep 5
   done
}

find . -type f | while read name ; 
do 
   max2; some_heavy_processing_command ${name} &
done
wait

#! /usr/bin/env bash

set -o monitor 
# means: run background processes in a separate processes...
trap add_next_job CHLD 
# execute add_next_job when we receive a child complete signal

todo_array=($(find . -type f)) # places output into an array

index=0
max_jobs=2

function add_next_job {
    # if still jobs to do then add one
    if [[ $index -lt ${#todo_array[*]} ]]
    # apparently stackoverflow doesn't like bash syntax
    # the hash in the if is not a comment - rather it's bash awkward way of getting its length
    then
        echo adding job ${todo_array[$index]}
        do_job ${todo_array[$index]} & 
        # replace the line above with the command you want
        index=$(($index+1))
    fi
}

function do_job {
    echo "starting job $1"
    sleep 2
}

# add initial set of jobs
while [[ $index -lt $max_jobs ]]
do
    add_next_job
done

# wait for all jobs to complete
wait
echo "done"

Having said that Fredrik makes the excellent point that xargs does exactly what you want…

With GNU Parallel it becomes simpler:

find . -type f | parallel  some_heavy_processing_command {}

Learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

I think I found a more handy solution using make:

#!/usr/bin/make -f

THIS := $(lastword $(MAKEFILE_LIST))
TARGETS := $(shell find . -name '*.sh' -type f)

.PHONY: all $(TARGETS)

all: $(TARGETS)

$(TARGETS):
        some_heavy_processing_command [email protected]

$(THIS): ; # Avoid to try to remake this makefile

Call it as e.g. ‘test.mak’, and add execute rights. If You call ./test.mak it will call the some_heavy_processing_command one-by-one. But You can call as ./test.mak -j 4, then it will run four subprocesses at once. Also You can use it on a more sophisticated way: run as ./test.mak -j 5 -l 1.5, then it will run maximum 5 sub-processes while the system load is under 1.5, but it will limit the number of processes if the system load exceeds 1.5.

It is more flexible than xargs, and make is part of the standard distribution, not like parallel.

This code worked quite well for me.

I noticed one issue in which the script couldn’t end.
If you run into a case where the script wont end due to max_jobs being greater than the number of elements in the array, the script will never quit.

To prevent the above scenario, I’ve added the following right after the “max_jobs” declaration.

if [ $max_jobs -gt ${#todo_array[*]} ];
    then
           # there are more elements found in the array than max jobs, setting max jobs to #of array elements"
            max_jobs=${#todo_array[*]}
 fi

Another option:

PARALLEL_MAX=...
function start_job() {
  while [ $(ps --no-headers -o pid --ppid=$$ | wc -l) -gt $PARALLEL_MAX ]; do
    sleep .1  # Wait for background tasks to complete.                         
  done
  "[email protected]" &
}
start_job some_big_command1
start_job some_big_command2
start_job some_big_command3
start_job some_big_command4
...

Here is a very good function I used to control the maximum # of jobs from bash or ksh. NOTE: the – 1 in the pgrep subtracts the wc -l subprocess.

function jobmax
{
    typeset -i MAXJOBS=$1
    sleep .1
    while (( ($(pgrep -P $$ | wc -l) - 1) >= $MAXJOBS ))
    do
        sleep .1
    done
}

nproc=5
for i in {1..100}
do
    sleep 1 &
    jobmax $nproc
done
wait # Wait for the rest

You may also like...

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.