|Informative Information for the Uninformed|
The PowerPC processor uses one or more on-chip memory caches to accelerate access to frequently referenced data and instructions. This cache memory is separated into a distinct data and instruction cache. Although the data cache operates in coherent mode on Mac OS X, shellcode developers need to be aware of how the data cache and the instruction cache interoperate when executing self-modifying code.
As a superscalar architecture, the PowerPC processor contains multiple execution units, each of which has a pipeline. The pipeline can be described as a conveyor belt in a factory; as an instruction moves down the belt, specific steps are performed. To increase the efficiency of the pipeline, multiple instructions can put on the belt at the same time, one behind another. The processor will attempt to predict which direction a branch instruction will take and then feed the pipeline with instructions from the predicted path. If the prediction was wrong, the contents of the pipeline are trashed and correct instructions are loaded into the pipeline instead.
This pipelined execution means that more than one instruction can be processed at the same time in each execution unit. If one instruction requires the output of another, a gap can occur in the pipeline while these dependencies are satisfied. In the case of store instruction, the contents of the data cache will be updated before the results are flushed back to main memory. If a load instruction is executed directly after the store, it will obtain the newly-updated value. This occurs because the load instruction will read the value from the data cache, where it has already been updated.
The instruction cache is a different beast altogether. On the PowerPC platform, the instruction cache is incoherent. If an executable region of memory is modified and that region is already loaded into the instruction cache, the modifed instructions will not be executed unless the cache is specifically flushed. The instruction cache is filled from main memory, not the data cache. If you attempt to modify executable code through a store instruction, flush the cache, and then attempt to execute that code, there is still a chance that the original, unmodified code will be executed instead. This can occur because the data cache was not flushed back to main memory before the instruction cache was filled.
The solution is a bit tricky, you must use the "dcbf" instruction to invalidate each block of memory from the data cache, wait for the invalidation to complete with the "sync" instruction, and then flush the instruction cache for that block with "icbi". Finally, the "isync" instruction needs to be executed before the modified code is actually used. Placing these instructions in any other order may result in stale data being left in the instruction cache. Due to these restrictions, self-modifying shellcode on the PowerPC platform is rare and often unreliable.
The example below is a working PowerPC shellcode decoder included with the Metasploit Framework (OSXPPCLongXOR).
;; ;; Demonstrate a cache-safe payload decoder ;; Based on Dino Dai Zovi's PPC decoder (20030821) ;; main: xor. r5, r5, r5 ; Ensure that the cr0 flag is always 'equal' bnel main ; Branch if cr0 is not-equal and link to LMain mflr r31 ; Move the address of LMain into r31 addi r31, r31, 68+1974 ; 68 = distance from branch -> payload ; 1974 is null eliding constant subi r5, r5, 1974 ; We need this for the dcbf and icbi lis r6, 0x9999 ; XOR key = hi16(0x99999999) ori r6, r6, 0x9999 ; XOR key = lo16(0x99999999) addi r4, r5, 1974 + 4 ; Move the number of words to code into r4 mtctr r4 ; Set the count register to the word count xorlp: lwz r4, -1974(r31) ; Load the encoded word into memory xor r4, r4, r6 ; XOR this word against our key in r6 stw r4, -1974(r31) ; Store the modified work back to memory dcbf r5, r31 ; Flush the modified word to main memory .long 0x7cff04ac ; Wait for the data block flush (sync) icbi r5, r31 ; Invalidate prefetched block from i-cache subi r30, r5, -1978 ; Move to next word without using a NULL add. r31, r31, r30 bdnz- xorlp ; Branch if --count == 0 .long 0x4cff012c ; Wait for i-cache to synchronize (isync) ; Insert XORed payload here .long (0x7fe00008 ^ 0x99999999)