It's tough to say how things will work out without seeing the data structures behind that code, but it seems like it ought to be fairly straightforward to decode.

Depending on your compiler and options, the truncation to integer may be one of the slowest operations that you have. Similarly, if you're targetting x87 code rather than something like SSE2, you're leaving a lot of performance on the table. Indexing the 3D array may (or may not be) more performance intensive than indexing a 1D array with precomputed offsets, especially on P4-class processors that are lacking in barrel shifter resources.

I'm a bit confused by the number and type of interpolations that you have. Classic Perlin noise has 7 linear interpolations (4 in x, 2 in y, and 1 in z). The fractional index term is first modified by (3*t*t-2*t*t*t) to get the desired smooth behavior (the improved Perlin noise uses a better quintic function).

Spinning the sphere should definitely be something done outside of the main loop. The code that I normally use for such computations passes in the cartestian coordinate for evaluation and doesn't know anything about how the world is sampled.