A routine is CPU bound if it would certainly go much faster if the CPU were faster, i.e. That spends the majority of that is time merely using the CPU (doing calculations). A routine that computes new digits the π will frequently be CPU-bound, it"s simply crunching numbers.

You are watching: I/o bound vs cpu bound

A program is I/O bound if it would certainly go quicker if the I/O subsystem to be faster. Which precise I/O device is meant can vary; I generally associate it through disk, however of food networking or interaction in general is usual too. A regime that looks with a huge paper for part data might come to be I/O bound, due to the fact that the bottleneck is then the analysis of the data indigenous disk (actually, this example is possibly kind that old-fashioned these days with numerous MB/s coming in native SSDs).

CPU Bound method the rate at which process progresses is restricted by the speed of the CPU. A job that performs calculations on a small collection of numbers, for instance multiplying small matrices, is most likely to be CPU bound.

I/O Bound method the rate at i beg your pardon a procedure progresses is limited by the speed of the I/O subsystem. A job that procedures data from disk, for example, counting the variety of lines in a file is likely to be I/O bound.

Memory bound method the rate at i m sorry a procedure progresses is limited by the amount memory obtainable and the rate of that memory access. A job that processes big amounts the in storage data, for instance multiplying big matrices, is most likely to be memory Bound.

Cache bound method the price at which a process progress is minimal by the amount and speed that the cache available. A job that just processes much more data 보다 fits in the cache will be cache bound.

I/O Bound would be slower 보다 Memory Bound would be slower 보다 Cache Bound would certainly be slower 보다 CPU Bound.

The solution to being I/O tied isn"t necessarily to get an ext Memory. In part situations, the access algorithm might be designed roughly the I/O, storage or Cache limitations. Watch Cache Oblivious Algorithms.

Multi-threading is wherein it tends to issue the most

In this answer, I will certainly investigate one vital use instance of distinguishing between CPU vs IO bounded work: once writing multi-threaded code.

RAM I/O bound example: Vector Sum

Consider a regime that sums all the values of a solitary vector:

#define size 1000000000unsigned int is;unsigned int sum = 0;size_t ns = 0;for (i = 0; i Parallelizing the by dividing the array equally because that each of her cores is of restricted usefulness ~ above common modern-day desktops.

For example, on mine Ubuntu 19.04, Lenovo ThinkPad P51 laptop v CPU: Intel main point i7-7820HQ CPU (4 cores / 8 threads), RAM: 2x Samsung M471A2K43BB1-CRC (2x 16GiB) I gain results prefer this:


Plot data.

Note that there is a lot of variance in between run however. Yet I can"t increase the array size lot further since I"m already at 8GiB, and I"m not in the mood because that statistics throughout multiple runs today. This seemed but like a typical run ~ doing many manual runs.

Benchmark code:

POSIX C pthread source code used in the graph.

And below is a C++ version that to produce analogous results.

plot script

I don"t know enough computer style to totally explain the form of the curve, however one thing is clear: the computation walk not become 8x quicker as naively expected as result of me making use of all mine 8 threads! For some reason, 2 and also 3 threads was the optimum, and adding more just makes things lot slower.

Compare this to CPU bound work, which actually does get 8 time faster: What carry out "real", "user" and also "sys" mean in the output of time(1)?

The factor it is all processors re-publishing a solitary memory bus linking to RAM:

CPU 1 -- Bus +-----+CPU 2 ---\__________| ram |... ---/ +-----+CPU N --/so the storage bus easily becomes the bottleneck, no the CPU.

This wake up because adding two numbers takes a single CPU cycle, storage reads take around 100 CPU cycles in 2016 hardware.

So the CPU work-related done every byte of intake data is too small, and we speak to this an IO-bound process.

The only means to speed up that computation further, would certainly be to rate up individual storage accesses with brand-new memory hardware, e.g. Multi-channel memory.

Upgrading come a quicker CPU clock for instance would no be really useful.

Other examples

matrix multiplication is CPU-bound top top RAM and also GPUs. The intake contains:

2 * N**2numbers, but:

N ** 3multiplications are done, and that is enough for parallelization to be worth it for practical huge N.

This is why parallel CPU matrix multiplication libraries like the following exist:


Cache consumption makes a big difference to the rate of implementations. Watch for example this didactic GPU compare example.

See also:

Why deserve to GPU carry out matrix multiplication quicker than CPU?BLAS identical of a LAPACK function for GPUs

A fake C++ CPU bound operation that bring away one number and crunches that a lot:


Sorting appears to it is in CPU based upon the complying with experiment: are C++17 Parallel Algorithms imposed already? which showed a 4x performance improvement for parallel sort, yet I would favor to have actually a much more theoretical confirmation together well

The renowned Coremark benchmark from EEMBC clearly checks how well a suite of difficulties scale. Sample benchmark an outcome clearing reflecting that:

How to discover out if you space CPU or IO bound

Non-RAM IO bound prefer disk, network: ps aux, then inspect if CPU% / 100 . If yes, you are IO bound, e.g. Prevent reads are simply waiting because that data and the scheduler is skipping that process. Climate use additional tools prefer sudo iotop to decision which IO is the problem exactly.

Or, if execution is quick, and you parametrize the variety of threads, you deserve to see it easily from time the performance boosts as the variety of threads increases for CPU bound work: What do "real", "user" and "sys" median in the output of time(1)?

RAM-IO bound: harder to tell, as ram wait time that is consisted of in CPU% measurements, view also:

How to check if application is cpu-bound or memory-bound?https://askubuntu.com/questions/1540/how-can-i-find-out-if-a-process-is-cpu-memory-or-disk-bound

Some options:

Intel torture Roofline (non-free): https://software.intel.com/en-us/articles/intel-advisor-roofline (archive) "A Roofline graph is a visual depiction of application performance in relation to hardware limitations, including memory bandwidth and also computational peaks."


GPUs have actually an IO bottleneck when you an initial transfer the input data native the constant CPU readable ram to the GPU.

Therefore, GPUs deserve to only be better than CPUs for CPU bound applications.

Once the data is moved to the GPU however, it have the right to operate on those bytes quicker than the CPU can, because the GPU:

has much more data localization than most CPU systems, and also so data deserve to be accessed quicker for some cores than others

exploits data parallelism and sacrifices latency by just skipping over any kind of data the is not ready to be activate on immediately.

Since the GPU needs to operate on big parallel intake data, it is far better to simply skip to the following data that might be easily accessible instead of wait for the current data to it is in come available and block all other operations favor the CPU mainly does

Therefore the GPU deserve to be faster then a CPU if her application:

can be very parallelized: various chunks of data have the right to be treated separately from one one more at the same timerequires a huge enough variety of operations per input byte (unlike e.g. Vector enhancement which go one enhancement per byte only)there is a large number of intake bytes

These designs choices originally target the applications of 3D rendering, whose key steps are as displayed at What are shaders in OpenGL and also what perform we require them for?

vertex shader: multiply a bunch that 1x4 vectors by a 4x4 matrixfragment shader: calculation the color of each pixel the a triangle based on its relative position withing the triangle

and so we conclude the those applications space CPU-bound.

With the advent of programmable GPGPU, we can observe number of GPGPU applications that offer as instances of CPU bound operations:

Is it feasible to build a heatmap from suggest data at 60 times per second?

Plotting that heatmap graphs if the plotted function is complex enough.


https://www.youtube.com/watch?v=fE0P6H8eK4I "Real-Time fluid Dynamics: CPU vs GPU" through Jesús Martín Berlanga

Solving partial differential equations such together the Navier Stokes equation of liquid dynamics:

highly parallel in nature, due to the fact that each point only interacts v their neighbourthere often tend to be sufficient operations per byte

See also:

Why space we still utilizing CPUs rather of GPUs?What room GPUs poor at?https://www.youtube.com/watch?v=_cyVDoyI6NE "CPU vs GPU (What"s the Difference?) - Computerphile"

CPython worldwide Intepreter Lock (GIL)

As a quick case study, I desire to allude out come the Python global Interpreter Lock (GIL): What is the an international interpreter lock (GIL) in CPython?

This CPython implementation detail avoids multiple Python subject from properly using CPU-bound work. The CPython docs say:

CPython implementation detail: In CPython, due to the an international Interpreter Lock, just one thread have the right to execute Python code at as soon as (even though details performance-oriented libraries could overcome this limitation). If you desire your application to make much better use the the computational resources of multi-core machines, you are advised to usage multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound work simultaneously.

See more: Draw The Lewis Structure For N2 ? N2 Lewis Structure

Therefore, right here we have an example where CPU-bound content is not suitable and I/O tied is.