[ CUDA: nVIDIA Tesla K40m (GK110) ]

    Device Properties:
      Device Name                                       Tesla K40m
      GPU Code Name                                     GK110
      PCI Domain / Bus / Device                         0 / 1 / 0
      Clock Rate                                        745 MHz
      Asynchronous Engines                              2
      Multiprocessors / Cores                           15 / 2880
      L2 Cache                                          1536 KB
      Max Threads Per Multiprocessor                    2048
      Max Threads Per Block                             1024
      Max Registers Per Block                           65536
      Max 32-bit Registers Per Multiprocessor           65536
      Max Instructions Per Kernel                       512 million
      Warp Size                                         32 threads
      Max Block Size                                    1024 x 1024 x 64
      Max Grid Size                                     2147483647 x 65535 x 65535
      Max 1D Texture Width                              65536
      Max 2D Texture Size                               65536 x 65536
      Max 3D Texture Size                               4096 x 4096 x 4096
      Max 1D Linear Texture Width                       134217728
      Max 2D Linear Texture Size                        65000 x 65000
      Max 2D Linear Texture Pitch                       1048544 bytes
      Max 1D Layered Texture Width                      16384
      Max 1D Layered Texture Layers                     2048
      Max Mipmapped 1D Texture Width                    16384
      Max Mipmapped 2D Texture Size                     16384 x 16384
      Max Cubemap Texture Size                          16384 x 16384
      Max Cubemap Layered Texture Size                  16384 x 16384
      Max Cubemap Layered Texture Layers                2046
      Max Texture Array Size                            16384 x 16384
      Max Texture Array Slices                          2048
      Max 1D Surface Width                              65536
      Max 2D Surface Size                               65536 x 32768
      Max 3D Surface Size                               65536 x 32768 x 2048
      Max 1D Layered Surface Width                      65536
      Max 1D Layered Surface Layers                     2048
      Max 2D Layered Surface Size                       65536 x 32768
      Max 2D Layered Surface Layers                     2048
      Compute Mode                                      Default: Multiple contexts allowed per device
      Compute Capability                                3.5
      CUDA DLL                                          nvcuda.dll (27.21.14.5423 - nVIDIA ForceWare 54.23)

    Memory Properties:
      Memory Clock                                      3004 MHz
      Global Memory Bus Width                           384-bit
      Total Memory                                      4095 MB
      Total Constant Memory                             64 KB
      Max Shared Memory Per Block                       48 KB
      Max Shared Memory Per Multiprocessor              48 KB
      Max Memory Pitch                                  2147483647 bytes
      Texture Alignment                                 512 bytes
      Texture Pitch Alignment                           32 bytes
      Surface Alignment                                 512 bytes

    Device Features:
      32-bit Floating-Point Atomic Addition             Supported
      32-bit Integer Atomic Operations                  Supported
      64-bit Integer Atomic Operations                  Supported
      Caching Globals in L1 Cache                       Supported
      Caching Locals in L1 Cache                        Supported
      Concurrent Kernel Execution                       Supported
      Concurrent Memory Copy & Execute                  Supported
      Double-Precision Floating-Point                   Supported
      ECC                                               Enabled
      Funnel Shift                                      Supported
      Half-Precision Floating-Point                     Not Supported
      Host Memory Mapping                               Supported
      Integrated Device                                 No
      Managed Memory                                    Not Supported
      Multi-GPU Board                                   No
      Stream Priorities                                 Supported
      Surface Functions                                 Supported
      TCC Driver                                        Yes
      Warp Vote Functions                               Supported
      __ballot()                                        Supported
      __syncthreads_and()                               Supported
      __syncthreads_count()                             Supported
      __syncthreads_or()                                Supported
      __threadfence_system()                            Supported

CUDA-Z Report
=============
Version: 0.10.251 64 bit http://cuda-z.sf.net/
OS Version: Windows x86 6.2.9200 
Driver Version: 454.23 (TCC)
Driver Dll Version: 11.0 (27.21.14.5423)
Runtime Dll Version: 6.50

Core Information
----------------
	Name: Tesla K40m
	Compute Capability: 3.5
	Clock Rate: 745 MHz
	PCI Location: 0:1:0
	Multiprocessors: 15 (2880 Cores)
	Threads Per Multiproc.: 2048
	Warp Size: 32
	Regs Per Block: 65536
	Threads Per Block: 1024
	Threads Dimensions: 1024 x 1024 x 64
	Grid Dimensions: 2147483647 x 65535 x 65535
	Watchdog Enabled: No
	Integrated GPU: No
	Concurrent Kernels: Yes
	Compute Mode: Default
	Stream Priorities: Yes

Memory Information
------------------
	Total Global: 11.9291 GiB
	Bus Width: 384 bits
	Clock Rate: 3004 MHz
	Error Correction: No
	L2 Cache Size: 48 KiB
	Shared Per Block: 48 KiB
	Pitch: 2048 MiB
	Total Constant: 64 KiB
	Texture Alignment: 512 B
	Texture 1D Size: 65536
	Texture 2D Size: 65536 x 65536
	Texture 3D Size: 4096 x 4096 x 4096
	GPU Overlap: Yes
	Map Host Memory: Yes
	Unified Addressing: Yes
	Async Engine: Yes, Bidirectional

Performance Information
-----------------------
Memory Copy
	Host Pinned to Device: 9851.78 MiB/s
	Host Pageable to Device: 9006.76 MiB/s
	Device to Host Pinned: 9940.9 MiB/s
	Device to Host Pageable: 9067.91 MiB/s
	Device to Device: 96.2937 GiB/s
GPU Core Performance
	Single-precision Float: 3363.67 Gflop/s
	Double-precision Float: 1405.32 Gflop/s
	64-bit Integer: 176.151 Giop/s
	32-bit Integer: 708.11 Giop/s
	24-bit Integer: 699.47 Giop/s