__host__ Specifier in CUDA

Last Updated : 17 Feb, 2026

__host__ specifier is a function type qualifier used to explicitly communicate to the compiler that a specific function is designed to reside and execute on the Host (CPU). It serves as a boundary marker, ensuring that the logic within the function is handled by the standard system processor rather than the parallel cores of the GPU.

Key Details of __host__

By default, every function in a .cu file is considered a __host__ function unless specified otherwise. This means standard C++ code behaves as host code automatically.

  • Compiler Assignment: When the NVCC compiler encounters a __host__ function, it passes that code to the system's native compiler (like GCC or MSVC), ensuring it is optimized for the CPU.
  • Orchestration Role: These functions are primarily used to manage the flow of the program, such as preparing data, allocating GPU memory via API calls, and launching kernels.
  • Calling Restrictions: A function marked with this keyword can only be called by other functions running on the CPU. It cannot be called directly by a thread running inside a GPU kernel.

Example: This example demonstrates how a function is explicitly marked to run on the CPU to perform standard setup tasks.

C++
%%cuda
#include <iostream>

// Explicitly marked host function
__host__ void initializeSystem() {
    printf("The Host (CPU) is initializing the application...\n");
}

int main() {
    // Calling the host function from the main entry point
    initializeSystem();
    return 0;
}

Output

The Host (CPU) is initializing the application...

Explanation:

  • __host__ void initializeSystem(): __host__ qualifier ensures this function is compiled for the CPU.
  • initializeSystem(): main function (which is also a host function) calls our custom function to execute its logic.
  • Execution: program runs sequentially on the CPU, printing the message to the console without involving the GPU hardware.

__device__ Specifier

__device__ acts as the counterpart to the host. It declares a function that is executed on the Device (GPU) and is only callable from other GPU-side code, such as a __global__ kernel. These are typically used as "helper functions" to perform repeated calculations across thousands of parallel threads.

C++
%%cuda
#include <iostream>

// Helper function running exclusively on the GPU
__device__ int calculateOffset(int x) {
    return x + 5;
}

// Kernel that acts as the bridge to call the device function
__global__ void offsetKernel() {
    int val = calculateOffset(10);
    printf("Result from GPU: %d\n", val);
}

int main() {
    offsetKernel<<<1, 1>>>();
    cudaDeviceSynchronize();
    return 0;
}

Output

Result from GPU: 15

Explanation:

  • __device__ int calculateOffset: function is compiled for the GPU's architecture. If you tried to call this from main(), the compiler would throw an error.
  • offsetKernel: __global__ function resides on the GPU, so it has the authority to call the __device__ helper.
  • Parallel Efficiency: Because it is a device function, it can be executed by thousands of GPU threads simultaneously if called from a larger kernel.

Combining __host__ and __device__

In many cases, we need a utility function (like a math formula) to work on both the CPU and the GPU. Instead of writing the code twice, CUDA allows to use both specifiers together. This allows compiler to generate two versions of the same function.

Example: This code uses a single function definition that works on both the CPU and the GPU.

C++
%%cuda
#include <iostream>

// Function compiled for BOTH CPU and GPU execution
__host__ __device__ float computeSquare(float x) {
    return x * x;
}

__global__ void gpuKernel() {
    printf("GPU Square: %.2f\n", computeSquare(4.0f));
}

int main() {
    // Calling the function on the CPU
    float cpuVal = computeSquare(4.0f);
    printf("CPU Square: %.2f\n", cpuVal);

    // Calling the function on the GPU
    gpuKernel<<<1, 1>>>();
    cudaDeviceSynchronize();

    return 0;
}

Output

CPU Square: 16.00
GPU Square: 16.00

Explanation:

  • __host__ __device__ float computeSquare: NVCC compiler creates two separate binary versions of this function.
  • CPU Path: When main() calls the function, the CPU version is used.
  • GPU Path: When gpuKernel runs on the device, it utilizes the GPU-optimized version.
  • Consistency: This ensures that mathematical logic remains identical regardless of which processor is performing the calculation.
Comment