Investigate loopy slowness in test_stokes

A run of the test_exterior_stokes test from a warm cache shows that it spends at least 40% of its time in cl_kernel_info:

Tue May 16 00:00:40 2017    stokes.prof

         278440157 function calls (241904286 primitive calls) in 350.224 seconds

   Ordered by: cumulative time
   List reduced from 11125 to 40 due to restriction <40>
   List reduced from 40 to 13 due to restriction <'loopy'>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     8001    0.054    0.000  158.109    0.020 /home/wala1/src/loopy/loopy/kernel/__init__.py:1455(__call__)
     8001    0.069    0.000  156.974    0.020 /home/wala1/src/loopy/loopy/target/pyopencl_execution.py:704(__call__)
      184    0.007    0.000  154.740    0.841 /home/wala1/src/loopy/loopy/target/pyopencl_execution.py:646(cl_kernel_info)
      184    0.294    0.002  109.709    0.596 /home/wala1/src/loopy/loopy/execution.py:140(get_typed_and_scheduled_kernel)
      184    0.003    0.000   39.202    0.213 /home/wala1/src/loopy/loopy/codegen/__init__.py:375(generate_code_v2)
      182    0.577    0.003   38.551    0.212 /home/wala1/src/loopy/loopy/type_inference.py:463(infer_unknown_types)
      184    0.003    0.000   35.959    0.195 /home/wala1/src/loopy/loopy/preprocess.py:1950(preprocess_kernel)
      184    0.003    0.000   34.696    0.189 /home/wala1/src/loopy/loopy/schedule/__init__.py:1967(get_one_scheduled_kernel)
      552    0.010    0.000   27.042    0.049 /home/wala1/src/loopy/loopy/kernel/__init__.py:1532(update_persistent_hash)
    50354    0.602    0.000   25.026    0.000 /home/wala1/src/loopy/loopy/type_inference.py:385(_infer_var_type)
   176662    1.599    0.000   21.812    0.000 /home/wala1/src/loopy/loopy/kernel/instruction.py:816(update_persistent_hash)
      552    0.001    0.000   20.856    0.038 /home/wala1/src/loopy/loopy/kernel/__init__.py:1570(__ne__)
      552    0.213    0.000   20.855    0.038 /home/wala1/src/loopy/loopy/kernel/__init__.py:1548(__eq__)

I don't fully trust these numbers because the test appears to have failed when running in CUDA, but this indicates something worth investigating.