Many people have already posted different kinds of microbenchmarking method for testing the size limits of various structures(e.g ROB, PRF). I recommend you to look at Henry Wong's blog post and also Maynard Handley's PDFs first. I got my idea from them and many others!
I found a slightly different variant based on them
Testing on Apple platform is always a hassle, and for my method we have to have a way to access the Performance counters directly in our microbenchmark. Dougall made a Kext for this purpose, basically what it does is to provide access to the configuration registers and so we can set the PMCs available for userspace access. And to make my life easier, I made a kext to pin the microbenchmark to a specific core. Code will be uploaded after I've structured them up.
First we've to get a model on Apple's PMCs. For the sake of simplicity I will only use the mrs(read) instructions. First of all what stage does this instruction executes? From Maynard's testing, The CPU can be separated into a few stages. Fetch, Decode, Rename, Mapping, Execution. you might guess that it is at the rename stage.
To figure out if the PMC read is executed in the frontend part of the machine, consider the following probe:
mrs x0, PMC0
8 ADDsSince the machine can run 8 Adds/cycle, and the frontend is 9 wide. If the PMC read is executed right at rename, the above probe should be able to sustain IPC of 9. The result: 6.31 IPC. Clearly, the read PMC instruction uses up the integer execution port! Note the 6.31IPC not 8 so something weird is happening here.
Since the instruction is executed in the execution stage, my guess(indirectly validated below) is that there is a 3-cycles latency from map->dispatch->scheduler->execution. And the PMC values are captured at the start of the execution stage. Following from that model let's start from the simple one:
It's not unreasonable to think the PMC register is read at rename stage and then pass down through the pipeline. But Testing results have shown that it's indeed read at the execution stage. First see the probe then I will explain,
Latency Block
PMC xN<-read
Adds depend on the latency block
PMC xM<-read
sub xM, xM, xN
To understand this you first have to know Apple's dispatch&scheduling structure first:
What the Adds do is to fill up the scheduler(instructions in dispatch buffer cannot be scheduled), and once done, the PMC xM<-read cannot enter the scheduler and thus cannot be scheduled. Instead it has to sit inside the dispatch buffer until the scheduler for it EU has space. If you think the PMC register value is read at the rename stage, then its value is captured early and must not be updated after. Hence, the latency between the first and second PMC read should be equal to the time taken to process the filler instructions in between them. else the value is captured in the execution stage, the latency value should spike when the scheduler of the PMC read instruction is filled. One caveat is if the latency numbers spike up at the number of Adds greater than the size of the dispatch+scheduler, then YOU SHOULD NOT interpret it as the latter case because in that case the whole machine include the frontend is stalled, and instructions in the rename stage cannot proceed either.
The first spike is at 155 Adds. Which is lesser than the size of dispatch+scheduler(190).
mrs x0, PMC0
nops
...
nops
mrs x1, PMC0
sub x1, x1, x0We know that the frontend is 9 wide, assume the following alignment in the rename & mapping stage, N is the number of NOPs in between
1 2 3 4 5 6 7 8 ... N+2
mrs nop nop nop nop nop nop nop ... mrs
Results:
nops latency
1-16 1
17-25 2
26-34 3
35-43 4
Case1: N < 8 Since the PMC read instruction can be executed at a throughput of 1/cycle, both the mrs instructions enter the dispatch buffer in the same cycle and then enter the scheduler back-to-back, so if we subtract the cycles numbers we should get 1. Indeed that's the case.
nop lat
1 1
2 1
3 1
4 1
5 1
6 1
7 1
Case2: 8<=N<(8+9)=17 Now the second mrs instruction enqueue the dispatch queue the cycle after the first mrs. But they are still executed back-to-back, so we expect the same 1cycle latency as case1.
Case3: 17<=N The latencies increase by 1 every 9 NOPs, which makes sense because we are just adding paddings in between the mrs.
There are two ways I found to probe the structures(again credits to Henry and Maynard both are inspired from their methods), First of our probe consists:
Latency Block //Without Futher notice I will use 23x fsqrt below
Filler Instructions
Read xN<-PMC_Cycles
Read xM<-PMC_Cycles
sub res, xM, xNThe idea is that normally the two PMC read instruction should be executed back-to-back and yields a one cycle latency as we'd seen just now. But When the certain structure (like ROB) has been used up, there is a point where the first Read PMC instruction can pass through the rename stage and to execution, while the second needs to wait for the structure to be released. Only then it can enter into the dispatch pipe.
The second way looks almost the same except where to place the PMC reads.
Latency Block //Without Futher notice I will use 23x fsqrt below
Read xN<-PMC_Cycles
Filler Instructions
Read xM<-PMC_Cycles
sub res, xM, xNLet's say the structure you want to test has size N. The latency block blocks the structure from freeing, when fillers count less than N, the latency between two PMC read is just the time needed to process the filler instructions. When the filler count is larger than N, the latency is time_to_process+time to free up space in the structure(so the 2nd PMC read can proceed down from the rename).
ROB
First we use NOPs as the filler,
Method1:

2180 62
...
2189 61
...
2198 60
Notice the spikes in the range 2180-2198, which match the size measured by other methods.
2180-2188 243
2189-2197 304
2198-2206 305
Spikes at 2189, Notice the difference between the two 304-243=61 which matches up the results from method1. This is the time taken to free up space in the ROB for the 2nd PMC read to proceed from the rename. What happened in between those numbers? 243 cycles*9NOPS/cycle=2187NOPs. This matches with numbers above. Notice the latency numbers stay the same from 2189-2197. One reasonable guess is that the machine must free at least 9 entries to allow the PMC read to proceed into dispatch. Otherwise, adding more filler instructions would increase latency instead of staying flat for 9 more fillers. Intuition is good but not perfect, so let's try to be more quantitative. To be precise, we have to figure out the exact alignment of the instructions when they enter rename&mapping stage. To make things easier, below is the probe
FSQRT *23
mrs xN, PMC_Cycles
NOPs *2189
mrs xM, PMC_Cycles
Recall that 243 cycles with 2180-2188NOPs, It requires our instructions align as the following, the decode is 9 wide and so the position of the 2180th NOP must be the 1st of the decode group, placing it anywhere else the latency will increase(because delaying the mrs instruction to the next cycle of decoding) before getting to 2188NOPs.
F=FSQRT
N=nop
M=Mrs
1 2 3 4 5 6 7 8 9
F F F F F F F F
F F F F F F F F F
F F F F F F M N N
N N N N N N N N N 2180..2189
M
With this in mind, and also assume 3 stages from map->dispatch->scheduler->execution. And take the case with 2189 NOPs.
T=0, The first 8 FSQRT enqueue into the ROB
T=1, 8+9=17FSQRT. The FSQRTs from T=0, are now in dispatch buffer
T=2, 17+6=23FSQRT,+1MRS+2NOP, FSQRTs from T=0, are now in scheduler
T=3, 1FSQRT(13cycles left)+22FSQRTS+1MRS+11NOPs, 1st FSQRT from T=0, is now in execution.
T=3+242=245, 1FSQRT(5cycles left)+4FSQRT+1mrs+2189NOPs
T=246, MRS enters the mapping stage, no slot in ROB machine stall
T=245+57, All FSQRTs clear, MRS can proceed
T=245+57+3, MRS enters execution.
And Guess WHAT, 57+3=60 the machine has stalled for 60 cycles, which is exactly what we'd expected 304-244(cycles needed If the ROB isn't full)=60. BTW this also validate the hypothesis there exists 3 stages between mapping and execution. So you might wonder why Apple do that? Perhaps that the ROB is organized into "freeing units", where each unit consists of X entries, and the minimum freeing granularity is one unit?
Before proceed forward, we have to understand how mrs behaves inside the ROB. one question we would like to know is how many "entries" are allocated per mrs? We cannot simply use mrs instruction as the filler because that would used up other resources(e.g PRF, scheduler) before we fill the entries ROB.
FSQRT *23(delay block)
mrs xN, PMC_Cycles
NOPs
MRSs
mrs xM, PMC_Cycles
So what we can do is to replace some of the NOPs to MRSs, and see what happen to the filler/time graph.
These are some interesting results, Look at the below figure, the red-shaded cells are the one with the same delay(e.g. replacing 2NOPs to 2MRSs has the same delay as replacing 3). What we are testing here is the "effective" ROB size, which is how many instructions the ROB can hold before it's full(thus the latency spike). The overall pattern is that the "effective" ROB size is getting smaller when we replacing more NOPs to MRSs, which indicates that each MRS instruction takes up more space than a single NOP. But How can we explain the alternating pattern?

Consider the results from the following probe,
FSQRT *23
FSQRT
mrs xN, PMC_Cycles
FSQRT(depends on pervious chain)
mrs xM, PMC_Cycles
At N=2188, the cycle time spikes from 243 to 303. Recall the diagram above, the code only changes the NOP at column 8 line 3 to FSQRT.
1 2 3 4 5 6 7 8 9
F F F F F F F F
F F F F F F F F F
F F F F F F F M N
N N N N N N N N N
M X X X
If the machine frees the entry that contain that newly added FSQRT, the latency would at least increase by 13 cycles. But as the result tells it does not. OK but this doesn't tell us What is the granularity of freeing, but notice what we can do now, the slots after that FSQRT do not free up because that FSQRT is blocking at the head. What we can do is to place X more NOPs before the second mrs(as shown below), The results show we can add 8 more NOPs before the Mrs, this indicates that
1 2 3 4 5 6 7 8 9
F F F F F F F F
F F F F F F F F F
F F F F F F M F N
N N N N N N N N N
N N N N ... M (Add X more NOPs before the Mrs)
362 264
Spikes at 362 Adds again matches with the public data.
Fp Adds,

416 115
PRRT(Physical register reclaim table)
This is a structure used by Apple to record updates in the mapping table. Using mov xN, #0 As filler.

650 25
659 91
668 117
3 Spikes across 650-668
Method2

We have been using only the cycle counter so far but there are a lot more other counters on Apple Silicon machines(both documented and undocumented), we will try to use a few to get some insights of the architecture. The definitions of the counter names are taken from Apple-Silicon-CPU-Optimization-Guide released by Apple.
From Section 4.3 Processor Pipeline Fundamentals of the guide, The word μop has the definition:
the processor decodes (translates) ISA instruction bytes into executable units called microoperations (μops).
And also the Retirement Unit (Retire): Collects completed μops and commits the oldest results to registers and memory to ensure the appearance of sequential execution.
ROB dequeueing speed
NOP is a good choice, since it requires ROB entries while doesn't take a destination register. And hence the speed of the ROB releasing doesn't constraint by the speed of freeing of other structures. In order for us to find out the max speed of freeing, we should make sure the (number of NOPs in ROB at the time the PMC is read) is greater than (the max number of NOPs can be freed/cycle * the latency between 2 PMC read). So the idea is we can use a delay block to block the ROB from freeing and until it's full, and then the delay block is freed the machine can start to drain the NOPs at full speed; And now we insert 2 back-to-back mrs instructions to get the retire counts.
FSQRT *23
NOPs
mrs xN, PMC_RETIRE_UOP
mrs xM, PMC_RETIRE_UOP
sub XM, XM, XN
The code is the same as method1 except instead of cycles we access retire_uop counter. There are 2 nuances in this experiment:
- You have to use the numbers after all the fsqrts have been fully freed otherwise the freeing speed will be constrained by the speed of fsqrts been executed.
- Be careful with the "gap" between the two MRS read, since when the machine returns back to execution and if one more ROB slot is available, the 2nd MRS instruction cannot execute until the next rob slot is available.
The 3 spikes have 280 retired uops and then stabilize to 56 retired uops. Recall from the pervious test using the cycle counter if we divide the retire uop by cycle between to mrs instruction, we get average retired uops/cycle.

2180 4.51
2189 4.59
2198 4.67
2199+ 56
Before 2180 we are seeing 0 uop retiring because at the point the two MRSs are executed the FSQRT is still blocking and the ROB is not full yet so the MRS instruction can enter the machine. Let's leave the 3 pikes to later. After 2199 NOPs I think what happen at the point of the two MRS read is
- All the fsqrt have finished execution
- The MRSs went through the pipeline and at the same time the ROB is draining
- The two MRS instructions are executed back-to-back and the NOPs in the ROB is more than the max drain speed/cycle(that's why you are seeing the flat line in the graph i.e. the remaining NOPs in the ROB can sustain a few cycles of max draining speed)
Now we know the max draining speed of the ROB is 56nops/cycle. Getting back to the 3 spikes. The theory I have is: The first MRS enters execution because there are space(s) in the ROB, however the second MRS cannot proceed because the ROB does not have space for it. Only after all the FSQRTs have finished execution the ROB can start freeing slots, from the pervious testing we know this requires ~57cycles(depends on the exact alignment of the uops in ROB). After 57cycles, first 56FSQPRTs+NOPs can be freed, and plus 3 more cycles for the MRS to pass down the pipeline 356 more NOPs are freed which in total 364NOP+56FSQRTs+NOPs=280, that's equal to the retired uops we get. As a side note, if we keep filling more NOPs retired uops/cycle will start to drop, which matches up with our theory because after some point the ROB no longer has enough NOPs to maximize the freeing speed in a single cycle
What if we use other fillers?
Replacing nop with mov xN, #0:
Results aren't so stable as the ROB, but the max TP is around 56, however I don't think this means the PRRT can be freed at 56slots/cycle, the reason is the structure itself is decoupled from the ROB and I think the retire_uops counter counts retire instruction from ROB. This is really one of the disadvantages of using more specific counters because we aren't so sure what they really measure...
We have to find another way to get the freeing speed of the PRRT.
Testing with MRS Perviously we saw some weird results with the MRS instructions, here we we try to find out how many MRS instructions are retired per cycle(from ROB)?
FSQRT *32
MRS *100
NOPs
mrs xN, PMC_RETIRE_UOP
mrs xM, PMC_RETIRE_UOP
sub XM, XM, XN
The results are normalized, the max retire uops is 8/cycle, compare to 56/cycle with NOPs this is 7 times lesser. 56/8=7, perhaps the ROB is separated into racks where each can hold 7 NOPs?
Filling the ROB with the following pattern we can try to find out how many NOPs can each rack hold? The idea is, assume each MRS takes one full rack of the ROB, and each rack can hold N NOPs, then we pad NOPs in between the MRSs; Like this:
NOP *M
MRS
NOP *M
MRS
If we increase the M count gradually, the uops retire/cycle should increase until M reaches N, then the NOPs in between will take two racks, and the utilization per rack is become lesser; Since in our theory the machine can retire 8 racks/cycle, the retire uop/cycle will drop and then start to increase again. (TODO: ADD RESULTs)
From pervious testing we know there exits a structure dispatch buffer sitting in front of the scheduler. One might ask Is there exist a bypass path from Map to the scheduler directly without passing through dispatch buffer thus avoiding one cycle delay(under the condition that the scheduler have enough space)? And also related question, if not available instruction to choose from scheduler, can instruction be issued directly from the dispatch buffer? (Related Patent Hierarchical reservation station thanks Maynard Handley for pointing this out.) [If you're interested this Apple Patent describes the overall structure of the dispatch buffer and scheduler.]
For the sake of simplicity let's address the second problem first. What we can do is to run a sequence of filler instructions that depend on a long latency chain, and after that we run a second sequence of independent instructions. If instructions can be executed directly from the dispatch buffer, we should see the latency spike when the dependent instructions have filled out both the dispatch buffer and scheduler, instructions in rename cannot proceed, then the latency spike. Or if the instructions cannot be executed from the dispatch buffer we should see the spike when only the schedulers are filled. For easier testing I will use FP instructions(since MRS don't interfere the FP execution). Step one is to find out the size of the FP dispatch buffer+scheduler
sdivs(latency chain)
fmov dK<-sdiv_result
mrs xN, PMC_Cycles
fadds(depend on fmov)
mrs xM, PMC_Cycles
sub xM, xM, xN
Clear spike with 175 fadds, the DB and Scheduler should be able to hold 1fmov+174adds, one more add the rename stall and mrs cannot proceed.
So now we know the DB+scheduler can hold 175 instruction, as stated we know add independent fadds behind the dependent fadds. Here is the results for adding different number of independent fadds.
Notice for independent fadds<=15, the dependent fadds you can fill goes down by 5 each time you add 5 more independent fadds behind. It suggests independent fadds in the dispatch buffer cannot be executed directly. Let's think about the case with 5 independent fadds:
With 170 dependent fadds in the dispatch buffer and scheduler, the 5 independent adds enter the dispatch buffer, however since they cannot be executed directly so they take up one slot in the dispatch buffer. TODO: Explain what happen with the redline.
Now we can answer the first question. The code looks almost the same but change the type of instruction used for dependent the block.
sdivs(latency chain)
fmov dK<-sdiv_result
mrs xN, PMC_Cycles
fcmp(depend on fmov)
fadds(Independent)
mrs xM, PMC_Cycles
sub xM, xM, xN
Recall that the FP part of the machine consists of one dispatch buffer that's coupled into 4 separate scheduler. Fcmp can only be executed in 1 out of the 4, and fadds can be executed on all 4; We can exploit this fact, use dependent fcmps to fill out the scheduler and and the dispatch buffer. Now both of the DB and scheduler are filled, we put some fadds behind them.
- If instructions can enter scheduler directly(if there is space) Because the other 3 schedulers are empty and fadds can enter them and execute, so no matter how many fadds we put the latency should not be affected.
- Instructions have to enter dispatch buffer first. No space in the dispatch buffer because fcmps have used out all, fadds cannot proceed machine stall.
Again we have to find out how many cmps we can hold in DB+scheduler,
Spikes with 95 fcmps + 1fmov, so 95 entries in dispatch buffer+scheduler for fcmp instructions.(And I think the bump in ~125 happened because limit of flag registers)
Now What happen if we add in the fsqrts?
The results show it's case 2, adding more fadds will acquire more slots in the dispatch buffer. However, notice for the case with adds: same as the last test, add takes space in the dispatch buffer. But the cases with 15 adds and 21 adds spike at the same number of fcmps. I think this is because the adds in dispatch buffer are not able to pass down to their respective schedulers even they are independent. When the independent fcmps fill out their scheduler they are going to fill the dispatch buffer, and since we have 2 scheduler for fcmp that's ~80, with 80 fcmps the schedulers are full so the next fcmps are going to reside in the dispatch buffer. Now when you fill more fadds than size of the dispatch buffer the rename is going to be blocked and the mrs cannot proceed. So the reason 20 adds spike at the same fcmps as 15 is that the adds can proceed if no fcmp in the DB, no matter how many adds we put they can enter the schedulers and get executed.
I think this has something to do to maintain the ordering of the instructions. My theory is that the dispatch buffer maintain some sort of time-tag for each instruction when instruction enter the dispatch buffer. Each cycle, the dispatch buffer checks two things
- Is the schedulers for the current instruction available?
- Does older unissued(i.e not issued into the scheduler)instruction exists?
If both of the requirements are satisfied(scheduler available and no older instruction in the buffer) then the instruction can proceed to the scheduler. Notice that the dispatch buffer does not do any attempt to check if the instruction is runnable. This maintains ordering between instructions, from the tests we know that the instructions first enter the dispatch buffer and then the scheduler. Let's take the case with fcmps, 9 fcmps are dispatched into the DB and where 2 are issued to the scheduler each cycle. Each cycle the DB finds 2 oldest fcmps to issue into the scheduler. So the schedulers always receive the instructions in age order.
For some reasons Apple decided to remove the documentation of their schedule_uop counter after a14/m1, so we have to do some test to figure out the the config code for the counters. This introduces some uncertainties, so take the results with a grain of salt.(If you would like to know the process on finding out the config code, see section #PMC counters. For continuity I will use the codes obtained from the Expreiment below. )
The main concept of modern machine is out-of-order execution of instructions i.e. execute then whenever they have their source operands ready. Applying this to the load/store instructions, For load - you want to execute it when it has it's address ready and the pervious store it depends on has produced it's value. For store - you can execute them whenever it's address and data are ready. Also you have to make sure the data is store into the cache only when the store has been confirmed non-speculative.
You have to save some extra states for each store instructions(basically it's address), so when it retires, you know where to save the data. So you create a structure called store queue. Each time when store enters the rename stage you allocate one entry in the queue and at retire stage complete the store(i.e. store data into cache)
The thing is really to establish a correct dependencies between store and load instructions, this is like register renaming but you have to figure out the dependencies in runtime. So how are we going to do it? Imagine you have a load that is available for execution(i.e. have its address), because the machine is out-of-order so it's possible the current load had executed while the store it depends on has yet generated its data. So you have to figure out does the store it depends on have yet executed? This is where the store queue comes in handy, you can just search for all store that is older than itself, and you get three cases:
- You find one youngest store that's older than the load that the load depends on. Note that all the stores in between that store and the load must have gotten it address. else fall back to 3)
- No stores in the queue match the address of the load. If there are un-computed store address fall back to 3)
- There are older stores that have not yet computed its address, so you are not sure whether the load depends on the stores currently inside the queue.
For 1), you can just forward the result from that store. For 2), just read the data from cache. The problem really arises in 3), and this is a common case, think something like a Loop, you load some data and do some processing, then save them to the memory(like link list?). Modern machine is very wide, so it's totally possible when you start the second iteration the pervious stores haven't get their address yet. The load instruction is usually the first instruction in the dependency chain so you always want to execute them early. If we delay the execution of the load until all the stores have calculated their address, this is going to hurt the performance and most time the load does not depends on the pervious store in the OoO window. What you can do is try to be aggressive, always send the load into cache assuming that there is no dependencies. But things can go wrong, in order to maintain the correct execution, you have to ensure that later when load retires, and it actually depends on a pervious store in the store queue, you have to flush the load and all the instructions after it. How do you perform the check? Extra states again! You mark the load as speculative, when a store has calculated it address it searches through all the loads that is younger than it, and see if it's speculative and if the address matches, if so then the load had read stale data from cache, it should be re-executed. Hence every time a load is executed and it's in case 3 you allocate one entry in the load queue, so later when the stores have executed they can search for loads that read stale data. Better you can do is to use a predictor to predict the dependencies between loads and stores. And if the predictor says the load depends on the pervious store, you only issue it after the store it depends on has generated its address.
So now our machine has two additional structures, Store queue and Load Queue, the ideal design would be when store instruction is issued you allocate one entry in the store queue. When Load instruction is issued and if it's in case 3(and you speculatively send it to cache) you allocate one entry in the load queue. The entry in store queue is mark completed when store instruction has retired. The load queue entry can be freed once all the stores proceed it have finished address checking. Now it's time to do some testing!
Fsqrt d0, d0
MRS PMC_CYCLE
FCVTZS x2, d0
Store xK, [x1, x2]; where x1+x2!=x0
Load xN, [x0]; Filler
MUL xM, xN, xN; Chain
MRS PMC_CYCLE
When the loads issue, the pervious store will be sitting in the scheduler because the index(x2) it depends on hasn't no produced yet. So according to our model, this will be case 3 since the store hasn't generated its address yet so we have to execute speculatively. The reason we have to make the store address different to the load address is because we want the Load-Store dependency predictor to eventually capture this effect and mark the loads independent of the store(More below). So after some number of loads, the load queue is full and we cannot issue more load speculatively then the add depends on the load result cannot execute. The MULs will start to fill out their respective schedulers and dispatch buffers and then block the rename hence the 2nd MRS cannot execute.
The first spike is at 146, but what's interesting is the sharp turns after. At 151 loads the 2nd MRS can proceed which implies the load it depends on can execute? But how?
When does apple choose to allocate a slot in the load queue? From the ideal case we assume perviously you should only allocate if you execute the load speculatively i.e. you made a predication the load doesn't depend on older unexecuted stores. However, it's also possible that apple choose to allocate whenever a load is issued. Consider the following probe:
Fsqrt d0, d0
MRS PMC_CYCLE
FCVTZS x2, d0
Store xK, [x1, x2]; where x1+x2!=x0
Store xK, [x3]; where x3==x0 and is available
Load xN, [x0]; Filler
MUL xM, xN, xN; Chain
MRS PMC_CYCLE
If the load queue entry is allocated only when the load instruction is executed speculatively, then the load filler should not take up space in the LDQ, because the store instruction right before the load fillers is going to forward the result directly to the loads(the address of that store is available at the time when load fillers issue so there is no need to guess here just grab the data from that store). On the other hand, if the LDQ is allocated right at the issue stage then the load would take up spaces in the LDQ.
The results shows the later case; I'm not sure why apple doesn't optimize this way, delaying the allocation after the store queue is scanned(and LDSP prediction is checked) would amplify the effective size of the LDQ in certain pattern of the code. Perhaps they will do so in the following updates?
M3(Everest) undocumented PMCs
e9 int issued?
ee fp/simd reg?
ef gpr dest reg related?
278 Total Issued?
280 LDST uop issued? LDR?
354 fp/simd issued?\
In the simple in-order one wide machine, we fetch one instruction every cycle and at the same time add 4 to our PC. Now if the instruction fetched in current cycle is a branch(that would be done in the pre-decode stage), we lockup the fetch engine and inserts bubbles to the next stage of the pipeline(thinking something like NOP). After X cycles the branch has been executed, and we know what is the next fetch address, then we can proceed our fetching.

