Handleiding AMD x86 Typewriter

Pagina 1

AM D Athlon Pr oc essor x86 Code Optimization Guide TM.

T ra demarks AMD , the A MD logo , A MD Athlon , K6, 3DNo w!, and combi nations ther e of, K 86, and Sup er7 ar e tr adema rks, and AMD -K6 is a r egis tered tra demark of Ad v anced Micr o De vices, I nc. Microso ft, Windows , and Wind ows NT are r egi stered trademarks of Micros oft Corp oration.

Pagina 3

Contents iii 22007E/0 — Novembe r 1 99 9 AMD Athlon™ Pr ocessor x86 Code Optimization Contents Revision Histo ry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 1 Intro duction 1 About this Docum ent . . . . .

Pagina 4

iv Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 Switch Statement Us age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Optimize Switch State ments . . . . . . . . . . . . . . . . . .

Pagina 5

Contents v 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use 8-Bit Sign- Extended Displacements . . . . . . . . . . . . . . . . . . . . . . . 39 Code Padding Using Neutral Code F illers . . . . . . . . . . . . . . . . .

Pagina 6

vi Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 7 Scheduling Opti mizations 6 7 Schedule Instructio ns According to their La tency . . . . . . . . . . . . . . 67 Unrolling Loops . . . . . . . . . . . . . . . .

Pagina 7

Contents vii 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Signed Deriva tion for Algorithm, Multiplier, and Shift Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 9 Floating-P oint Optimizations 9 7 Ensure All FP U Data is Alig ned .

Pagina 8

viii Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 Fast Conver sion of Signed Wo rds to Floating-P oint . . . . . . . . . . . . 113 Use MMX PX OR to Negate 3 DNow! Data . . . . . . . . . . . . . . . . . . . . 113 Use MMX PCM P Instead of 3D Now! PFCMP .

Pagina 9

Contents ix 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Integer Execution Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Floating-Point Scheduler . . . . . . . . . . . . . . . . . . . . . . . . .

Pagina 10

x Contents AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9 Perf Ctr[3:0] MSRs (MSR Addre sses C001_00 04h – C001_0007h) . . . . . . . . . . . . . 167 Starting and Stopping the Perfor mance-Monitoring Counters . . . . . .

Pagina 11

List of Figures xi 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization List of Figures Figure 1. AMD Athlon ™ Processo r Block Diagr am . . . . . . . . . . . 131 Figure 2. Integer Execution Pipeline . . . . . . . . . . . .

Pagina 12

xii List of Figur es AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 99 9.

Pagina 13

List of T ables xiii 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization List of T ables Table 1. Latency of Repeated String Instr uctions . . . . . . . . . . . . . 84 Table 2. Integer Pipeline Operation T ypes . . . . . . .

Pagina 14

xiv List of T ables AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Table 29. VectorPa th Integer In structions . . . . . . . . . . . . . . . . . . . 231 Table 30. VectorPa th MMX Instructions . . . . . . . . . . . . . .

Pagina 15

Revision History xv 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Revision History Date Rev Descriptio n Nov . 1 999 E Added “ About this Document” on page 1. F urther clarification of “Consider the Sign of Integer Operands” on page 1 4.

Pagina 16

xvi Revision History AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9.

Pagina 17

22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization About this Docume nt 1 1 Introduction Th e A M D At h l o n ™ processor is the ne west micr oprocessor in the AMD K86 ™ famil y of micropr ocessors.

Pagina 18

2 About this Document AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 pr e vious- gener ation processor s and describes how those optimizations ar e applicable to the AMD Athlon processor . This guide co ntains the f ollowing c hapt er s: Chapter 1: Introduction.

Pagina 19

AMD Athlon ™ Proces sor Family 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Appendix B: Pipeline and Execu tion Unit Resources Over view . Describes in detail the e xecution units and its r elation to the instructi on pipeline.

Pagina 20

4 AMD Athlon ™ Processor Mic roarchitectur e Summary AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 AM D Athlon ™ Proc essor Microar chitecture Summary T he AMD Athlon pr ocessor brings s uper scalar performance and high operating frequency to P C syste ms run ning industr y- standard x86 softw ar e.

Pagina 21

AMD Athlon ™ Processor Mic roarchitecture Summary 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization AMD A thlon execution c or e to ac hiev e and sustain maxim um performance.

Pagina 22

6 AMD Athlon ™ Processor Mic roarchitectur e Summary AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T he coding tec hniques for ac hieving peak perf ormance on the AMD Athlon processor include, but are not limited to , those for the AMD-K6, AMD-K6-2, P e ntium ® , P enti um Pro , and P ent ium II pr ocessor s.

Pagina 23

22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Top Optimiz ations 7 2 T op Optimizations T his chap ter contains concise desc riptions of the best optimizations f or impro ving the performance of the AMD Athlon ™ processor .

Pagina 24

8 Optimization Star AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ■ A void Placing Cod e and Da ta in the Same 64 -Byte Cache Line Optimization Star T he top optimizations described in this c hapter ar e flagged with a star .

Pagina 25

Group II Optimizati ons — Secondary Optimizations 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization an ywher e, in an y type of code (integer , x87, 3DNo w!, MMX, etc.). Use the f ollowi ng f ormul a to determine pr efetc h distance: Prefetc h Length = 200 ( DS / C ) ■ Round up to the near est cache line.

Pagina 26

10 Group II Optimizations — S econdary Optimizations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 A void Load-Execute Floating-Point Instructions with Integer Opera nds Do not use load-execute floating-point instructions with integer operands .

Pagina 27

Group II Optimizati ons — Secondary Optimizations 11 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization A void Placing Code and Data in the Sam e 64-Byte Cache Line Consider that the AMD Athlon processor cac he line is twice the siz e of pr e vious processor s.

Pagina 28

12 Group II Optimizations — S econdary Optimizations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9.

Pagina 29

22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Ensure Floati ng-Point Variables and Exp ressions are of Type Float 13 3 C Sourc e Lev el Optimizations This c h apter details C pro gramming pr actice s f or opt imizing code f or the AMD Athlon ™ pr ocessor .

Pagina 30

14 Consider the S ign of Integer Operands AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Consider the Sig n of Integer Oper ands In man y cases, the data stored in integer v aria bles determines whether a signed or an unsigned integer type is appr opriate.

Pagina 31

Use Array Style Instead of Poin ter Style Code 15 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example (Av oid): int i; ====> MOV EAX, i CDQ i = i / 4; AND EDX, 3 .

Pagina 32

16 Use Array Style Instead of Pointer Style Co de AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Note that source code transf ormations wi ll interact with a compiler ’ s code gener ator and that it is difficult to contr ol the gener ated mac hine code fr om the sourc e lev el.

Pagina 33

Use Array Style Instead of Poin ter Style Code 17 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization *res++ = dp; /* write transformed z */ dp = vv->x * *m++; dp += vv-&.

Pagina 34

18 Completely Unr oll Small L oops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Completely Unr oll Small Loops T ak e ad v antage of the AMD At hlon pr ocessor ’ s large, 64-Kb yte instruct ion cache and completel y unroll small loops.

Pagina 35

Avoid Unnecessary Store-to-Load Depend encies 19 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization code in a w a y that a v oids the stor e-to-load dependency . In some instances the language definition ma y prohibit the compiler fr om using code tra nsforma tions that would r emo v e the stor e- to-load dependenc y .

Pagina 36

20 Consider Expressi on Order in Compoun d Branch Conditions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Consider Expr ession Order in Compound Branch Conditions Br .

Pagina 37

Switch Statement Us age 21 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Switch Statemen t Usage Optimize Switch Statements Switc h statements ar e transl ated using a vari ety of algorithms. T he most common of these ar e jump ta bles and comparison c hains/t r ees.

Pagina 38

22 Use Const T ype Qualifier AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Use Const T ype Qualifier Use the “ const ” type qualifier as m u c h as possible.

Pagina 39

Generic Loop Hoisting 23 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Generalization for M ultiple Const ant Control C ode T o gener alize this further f or multiple constant control code some mor e w ork ma y ha ve to be done to cr eate the pr oper outer loop .

Pagina 40

24 Declar e Local Functions as Static AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 case combine( 1, 1 ): for( i ... ) { DoWork1( i ); DoWork3( i ); } break; default: .

Pagina 41

Dynamic Memory All ocation Consideration 25 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization which might inhibit certain op timizations with some compiler s — for example, agg r essiv e inlining.

Pagina 42

26 Explicitly Extract Common S ube xpressions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 lead to unexpected r esults. F ortunately , in the v ast majority of cases, the final result will differ onl y in the least significa nt bits.

Pagina 43

C Language Struc ture Component Considerations 27 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 1 Avoid: double a,b,c,d,e,f; e = b*c/d; f = b/d*a; Preferred: d.

Pagina 44

28 Sort L ocal V ariables Acco rding to Base T ype Size AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 P ad by Multiple of Largest Base T ype Size P ad the structur e to a m ultiple of the larg est base type siz e of an y member .

Pagina 45

Accelerating Floating-Point Div ides and Square Roots 29 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization quadw ord alignment), so that quadw or d operands might be misaligned, ev en if this technique is used and the compiler does alloca te v ariables in t he order they ar e de clared.

Pagina 46

30 Accel erating Floating-Point Divides and Squar e Roots AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 necessar y for the c urr ently s elected pr ecision. This means that settin g pr ecision c ontrol to singl e pr ecisio n (v ersus Win32 default of double precision) lo w ers the latenc y of those oper ations.

Pagina 47

Avoid Unnecessary Integ er Division 31 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization A void Unnec essary Integer Division Integer divisi on is the slow est of all integer arithmetic oper ations a nd should be a v oided wh er ev er possi ble.

Pagina 48

32 Copy Fr equently De-r eferenced Pointe r Arguments to Local V ariables AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): //assumes pointers are diff.

Pagina 49

22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Over view 33 4 Instruction Dec oding Optimizations T his c hapter discusses w a ys to maximize the n umber of instructions decoded by the instruction decoder s in the AMD Athlon ™ pr ocessor .

Pagina 50

34 Select Dir ectPath Over V ectorPath Instructions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Select DirectP ath Over V ectorP ath Instructions Use Dir ect P ath instructions rather than V ectorP ath instructions.

Pagina 51

Load-Execute Instructio n Usage 35 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use Load-Execute Floating-Point Instructions with Floating-P oint Operands W hen opera.

Pagina 52

36 Align Branch T argets in Pr ogram Hot Spots AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): FLD QWORD PTR [foo] FIMUL DWORD PTR [bar] FIADD DWORD .

Pagina 53

Avoid Partial Reg ister Reads and Writes 37 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 2 (Preferred): 05 78 56 34 12 add eax, 12345678h ;uses single byte ; .

Pagina 54

38 Replace C ertain SH LD Instructions with Alternative AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Replac e Certain S H LD Instructions with Alternative Code Certain instances of the SHLD instruction can be r eplaced b y alternati v e code using SHR and LEA.

Pagina 55

Use 8-Bit Sign-E xtended Displacements 39 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use 8-Bit Sign-Extended Displac ements Use 8- bit sign- extend ed displacements for condition al br anc hes.

Pagina 56

40 Code Padding Using Neutral Code F illers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Recommendation s for th e AM D Athlon ™ Processo r F or code that is optimi.

Pagina 57

Code Padding Usi ng Neutral Code Fillers 41 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Recommendati ons for AM D- K6 ® Family and AM D Athlon ™ Processor Blen de.

Pagina 58

42 Code Padding Using Neutral Code F illers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 NOP3_ECX TEXTEQU <DB 08Dh,00Ch,021h> ;lea ecx, [ecx] NOP3_EDX TEXTEQU &.

Pagina 59

Code Padding Usi ng Neutral Code Fillers 43 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ;lea edi ,[edi+00000000] NOP6_EDI TEXTEQU <DB 08Dh,0BFh,0,0,0,0> ;lea e.

Pagina 60

44 Code Padding Using Neutral Code F illers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9.

Pagina 61

22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Memory Size a nd Alignment Issues 45 5 Cache and Memory Optimizations T his chapter describes code optimization tec hniques that tak e ad v anta ge of the large L1 caches and high-band width buses of the AMD Athlon ™ proces sor .

Pagina 62

46 Use the 3DNow! ™ PR EF ET C H and PR E FETCHW AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Align Data Where P ossible In general, a v oid misaligned data references. All data who se siz e is a pow er of 2 is cons ider ed aligned i f it is naturally aligned.

Pagina 63

Use the 3DNow ! ™ PREFE TCH and PREFETCHW I nstructions 47 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization PRE FET CH /W ve rs us PR E F ETC H N T A/T0/T1 /T2 T he PREFETCHNT A/T0/T1/T2 instructions in the MMX extensions ar e pr ocessor implement ation dependent.

Pagina 64

48 Use the 3DNow! ™ PR EF ET C H and PR E FETCHW AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 MOV ECX, (-LARGE_NUM) ;used biased index MOV EAX, OFFSET array_a ;get .

Pagina 65

Use the 3DNow ! ™ PREFE TCH and PREFETCHW I nstructions 49 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T he follo wing optimiza tion rule s w er e app lied to this example . ■ Loops should be unr olled to mak e sur e that the data stride per loop i ter ation is equal to the length of a cac he line.

Pagina 66

50 T ake A dvantage of W rite Combining AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T ak e Advantage of W rite Combining Oper ating system and device dri v er pro gr ammers sh ould tak e ad v antage of the write- combining capabili ties of the AMD Athlon pr ocessor .

Pagina 67

Store-to-Load F orwarding Restrictions 51 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Store-to-Load F o rwarding R estrictions Stor e-to-load forw arding r efers to the pr ocess of a load reading (f orw ar ding) data fr om the stor e buffer (LS2).

Pagina 68

52 Store-to -Load Forwar ding Restrictions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Narrow-to-Wide Store-Buffer Data F orwarding Restriction If the f ollo wing co.

Pagina 69

Store-to-Load F orwarding Restrictions 53 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 5 (Preferred): MOVD [foo], MM1 ;store lower half PUNPCKHDQ MM1, MM1 ;get upper half into lower half MOVD [foo+4], MM1 ;store lower half .

Pagina 70

54 Stack Alignment Consider ations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 One Supported Store- to-Load Forw arding Case T her e is one case of a mism atc hed stor e-to- load fo rw arding that is supported by the b y AMD Athlon pr ocessor .

Pagina 71

Align TBYTE Variab les on Quadword Aligned Addres ses 55 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example (Preferred): Prolog: PUSH EBP MOV EBP, ESP SUB ESP, SIZE.

Pagina 72

56 Sort V ariables Accordin g to Base T ype Size AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example: struct { char a[5]; long k; doublex; } baz; T he str uctur e components should be alloc ated (lo west to highes t addr ess) as follo ws: x, k, a[4], a[3], a[2], a[1], a[0], padbyte6, .

Pagina 73

22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Avoid Branches Depende nt on Random Data 57 6 Br anch Optimizations W hile th e AMD Athlon ™ pr ocessor contains a v ery sophisticated br anch unit, certain optimizations increase t he effect iv eness of the br anc h pr ediction unit.

Pagina 74

58 A void Branches De pendent on Random Dat a AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 AM D Ath lon ™ Proces sor Spec ific Code E xample 1 — Signed integer AB.

Pagina 75

Always Pair CALL and RETURN 59 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 6 — Increment Ring Buffer Offset: //C Code char buf[BUFSIZE]; int a; if (a < .

Pagina 76

60 Replace Bran ches with Computation in 3DNow! ™ Code AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Rep lace Br anches with Computa tion in 3D Now! ™ Code Br anches negati vel y impact the perf ormance of 3DNo w! code.

Pagina 77

Replace Branches wi th Computation in 3DNow! ™ Code 61 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 2 (Preferred): ; r = (x < y) ? a : b ; ; in: mm0 a ; .

Pagina 78

62 Replace Bran ches with Computation in 3DNow! ™ Code AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 2: C code: float x,z; z = abs(x); if (z >= 1) { z = 1.

Pagina 79

Replace Branches wi th Computation in 3DNow! ™ Code 63 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 4: C code: #define PI 3.

Pagina 80

64 Replace Bran ches with Computation in 3DNow! ™ Code AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 5: C code: #define PI 3.

Pagina 81

Avoid the Loop Instruction 65 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization A void the Loop Instruction T he LOOP instruction in the AMD A thlon pr ocessor r equires eight cycles to e xecute.

Pagina 82

66 A void Recursive Functions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 A void R ecursive Functions A void r ec ur siv e func tions due to the danger o f o verflo wing t he r eturn addr ess stac k. Con v ert end- r ecur siv e functions to iterati ve code.

Pagina 83

22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Schedule In structions According to their Latenc y 67 7 Scheduling Optimizations T his c hapter descr ibes ho w to code instruc tions f or efficient scheduling. Guidelines ar e lis ted in or der of impor tance.

Pagina 84

68 Unrolling Loops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 unroll ing r educ es r egist er pr essur e by r emoving the loop counter . T o complete l y unroll a loop, remo ve the loop control and r eplicate the loop bod y N times.

Pagina 85

Unrolling Loops 69 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Without Loop Unrolling: MOV ECX, MAX_LENGTH MOV EAX, OFFSET A MOV EBX, OFFSET B $add_loop: FLD QWORD PTR [EAX] FADD QWORD PTR [EBX] FSTP QWORD PTR [EAX] ADD EAX, 8 ADD EBX, 8 DEC ECX JNZ $add_loop T he loop consists of se v en instructions.

Pagina 86

70 Unrolling Loops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 no faster than three iter a tions in 10 cycles, or 6/10 floating-po int adds per c ycle, or 1.4 times as f ast as the or iginal loop. Deriving Loop Control For P arti ally Unrolled Loops A fr equentl y used loop construct is a counting loop.

Pagina 87

Use Function Inlini ng 71 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use Function In lining Overview Mak e use of the AMD A thlon pr ocessor ’ s large 64- Kbyte instruct ion cache b y inl ining sm all routines to av oid pr ocedur e- call ov erhead.

Pagina 88

72 A void Address Generati on Interlocks AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novembe r 1 9 9 9 Always Inline Fu nctions if Called from One Site A function should alw a ys be inlined if it can be established that it is called from just one site in the code.

Pagina 89

Use MOVZX and MO VSX 73 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 1 (Av oid): ADD EBX, ECX ;inst 1 MOV EAX, DWORD PTR [10h] ;inst 2 (fast address calc.) MOV ECX, DWORD PTR [EAX+EBX] ;inst 3 (slow address calc.

Pagina 90

74 Minimize Po inter Arithmetic in L oops AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): int a[MAXSIZE], b[MAXSIZE], c[MAXSIZE], i; for (i=0; i <.

Pagina 91

Push Memory Data Carefu lly 75 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization v ariable that starts wi th a negati ve v alue and r eac hes zero when the loop expires. Note that if the base addresses ar e held in r egisters (e.

Pagina 92

76 Push Memory Data Careful ly AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novembe r 1 9 9 9.

Pagina 93

22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Replace Divi des with Multiplies 77 8 Integer Optimizations T his c hapter desc ribes w a ys to impr ov e integer p erf ormance thr ough optimize d pr ogr amming tec hniques.

Pagina 94

78 Replace Div ides with Multiplies AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Signed Division Utility In the opt_utilities dir ector y of the AMD documentation CDR O M, ru n sdiv .exe in a DOS window to find the fastest code fo r si gned di vision b y a constant.

Pagina 95

Replace Divi des with Multiplies 79 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 1: ;In: EDX = dividend ;Out: EDX = quotient XOR EDX, EDX;0 CMP EAX, d ;CF = (.

Pagina 96

80 Replace Div ides with Multiplies AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ;algorithm 1 MOV EAX, m MOV EDX, dividend MOV ECX, EDX IMUL EDX ADD EDX, ECX SHR ECX,.

Pagina 97

Use Alternative Code When Multiplying by a Co nstant 81 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Remainder of Signed Integer 2 n or – (2 n ) ;IN:EAX = dividend .

Pagina 98

82 Use Alternative Code When Multiplying b y a Constant AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 by 11: LEA REG2, [REG1*8+REG1] ;3 cycles ADD REG1, REG1 ADD REG1,.

Pagina 99

Use MMX ™ Instructio ns for Integer-Only Work 83 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization by 26: use IMUL by 27: LEA REG2, [REG1*4+REG1] ;3 cycles SHL REG1, 5 S.

Pagina 100

84 Repeated String Instructi on Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 In addition, using MMX instructi ons incr eases t he a v ailable par allelism. T he AMD Athlon proces sor can issue thr ee integer OPs and two MMX OPs per cycle.

Pagina 101

Repeated String I nstruction Usage 85 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Ensure D F=0 (U P) A lway s m a ke s u re t h a t D F = 0 ( U P ) ( a f t e r ex e c u t i o n o f C L D ) fo r REP MO VS an d REP STOS.

Pagina 102

86 Use X OR Instruction to Cl ear Integer Registers AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Use X O R Instruction to Clear Integer Registe rs T o clear an inte ger r egister to all 0s, use “ X OR r eg , r eg ” .

Pagina 103

Efficient 64-Bi t Integer Arithmetic 87 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Example 4 (Le ft shift ): ;shift operand in EDX:EAX left, shift count in ECX (cou.

Pagina 104

88 Efficient 64-Bit Integer Arithmeti c AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 7 (Division): ;_ulldiv divides two unsigned 64-bit integers, and returns ; the quotient.

Pagina 105

Efficient 64-Bi t Integer Arithmetic 89 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization MOV ECX, EAX ;save quotient IMUL EDI, EAX ;quotient * divisor hi-word ; (low only) MUL DWORD PTR [ESP+20];quotient * divisor lo-word ADD EDX, EDI ;EDX:EAX = quotient * divisor SUB EBX, EAX ;dividend_lo – (quot.

Pagina 106

90 Efficient 64-Bit Integer Arithmeti c AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 $r_two_divs: MOV ECX, EAX ;save dividend_lo in ECX MOV EAX, EDX ;get dividend_hi .

Pagina 107

Efficient Impl ementation of Populati on Count Function 91 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Efficient Implementation of Population Co unt Function P opulation count is an oper ation that determines the number of set bits in a bit string.

Pagina 108

92 Efficient Impl ementation of Populat ion Count Function AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Step 3 F or the fir st time, the v alue in each k-bit field is small eno ugh that adding two k-bit fields r esults in a v alue that stil l fits in the k-bit field.

Pagina 109

Derivation of Multipl ier Used for Integer Division by Constants 93 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ADD EAX, EDX ;x = (w & 0x33333333) + ((w >>.

Pagina 110

94 Derivation of Multiplie r Used for Integer Division by AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ;algorithm 1 MOV EDX, dividend MOV EAX, m MUL EDX ADD EAX, m AD.

Pagina 111

Derivation of Multipl ier Used for Integer Division by Constants 95 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization /* Generate m, s for algorithm 1. Based on: Magenheimer, D.J.; et al: “Integer Multiplication and Division on the HP Precision Architecture”.

Pagina 112

96 Derivation of Multiplie r Used for Integer Division by AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ;algorithm 1 MOV EAX, m MOV EDX, dividend MOV ECX, EDX IMUL EDX.

Pagina 113

22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Ensure All FP U Data is Ali gned 97 9 Floating-P oint Optimizations T his c hapt er details the methods used to optimiz e floating-point code to the pipelined floating-point unit (FPU).

Pagina 114

98 Use FFRE E P Macr o to Pop On e Register fr om the FPU AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Use F F R E E P Macro to P op One Register fr om the F P U Stack In FPU intensi v e code, fr equently accessed data is oft en pr e-loaded at the bottom of the FPU stac k befor e pr ocessing floating- point data.

Pagina 115

Use the FXCH Instruction Rather tha n FST/FLD Pairs 99 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T hese instruc tions ar e muc h faster than the classical appr oach using FSTSW , because FSTSW is essentiall y a serializing instruction on the AMD Athlon pr ocess or .

Pagina 116

10 0 Minimize Floating-Po int-to-Integer Conversions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Minimize Floating-P oint-to-Integer Con versio ns C++, C, an d F ortr an define floa ting-point-t o-integer con v er sions as truncating .

Pagina 117

Minimize F loating-Point-to-Integer Conversi ons 10 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization FPU into truncating mode, and perf orming all of the conv ersions before restoring the original control w ord. The speed of the a bo v e code is somewhat dependent on the natur e of the code surrounding it.

Pagina 118

10 2 Minimize Floating-Po int-to-Integer Conversions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 3 (P otentially faster): MOV ECX, DWORD PTR[X+4] ;get upper 32 bits of double XOR EDX, EDX ;i = 0 MOV EAX, ECX ;save sign bit AND ECX, 07FF00000h ;isolate exponent field CMP ECX, 03FF00000h ;if abs(x) < 1.

Pagina 119

Floating-Point Subex pression Elimination 10 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Floating-P oint Subexpr ession Elimination T her e ar e cases which do not r equir e an FXCH instruction after e v er y instruction to allo w access to tw o new stac k entries.

Pagina 120

10 4 Check Argument Range of T rigonometric Instructions AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 If an “ ar gument out of r ange ” is detected, a r ange r eduction subr o utine is in v ok ed whic h r educes the ar gument to less than 2^63 befor e the instruction is attempted again.

Pagina 121

Take Advantag e of the FSINCOS Instruction 10 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Since out- of-r an ge arguments ar e extremely uncommon, the conditional br anch will be perfectly pr edicted, and the other instructions used to guard the trigonometric instruction can execute in par allel to it.

Pagina 122

10 6 T ake Advantage of the FSI NCOS Instruction AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9.

Pagina 123

22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use 3DNow! ™ Instr uctions 10 7 10 3D Now! ™ and M MX ™ Optimizations T his chapter describes 3DNow! and MMX code optimization tec hniqu es f or the AMD Athlon ™ processo r .

Pagina 124

10 8 Use 3DNow! ™ Instructions for Fast Div ision AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 FEMMS instru ction is suppo rted fo r bac kw ar d compatibili ty with AMD-K6 famil y p r ocessors, and is aliased t o the EMMS instruction.

Pagina 125

Use 3DNow! ™ Instructions for Fast Division 10 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Pipelined P a ir of 24-Bit Precisio n Divides T his di vi de operation.

Pagina 126

110 Use 3DNow ! ™ Instructions for Fast Square Ro ot and AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novembe r 1 9 9 9 Use 3D Now! ™ Instructions for F a st Squar e Root and Recip.

Pagina 127

Use MMX ™ PMADDWD Ins truction to Perform Two 32-Bit Multipli es in Parallel 111 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Newton- Raphson Re cipr ocal Squa re R.

Pagina 128

112 3D Now! ™ and MMX ™ Intra-Operand S wapping AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example: PXOR MM2, MM2 ; 0 | 0 MOVD MM0, [ab] ; 0 0 | b a MOVD MM1, [.

Pagina 129

Fast Conversion of S igned Words to Floating-Poin t 113 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization F ast Conversion of Signed W ords to Floating-P oint In many appl ications there is a need to quickl y conv ert data consisting of pac ked 16-bit signed integer s into floating-point n umbers.

Pagina 130

114 Us e M MX ™ P CM P Instead of 3DNow! ™ PFCMP AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 c ycle b ypassing penalty , and another one c ycle penalty if the r esult goes to a 3DNo w! operation.

Pagina 131

Use MMX ™ Instructio ns for Block Copies and Block Fills 115 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use M MX ™ Instructions for Block Copies and Block Fills F or moving or filling small bloc ks of data (e.g.

Pagina 132

116 Us e M MX ™ Instructions for Block Copies and Block Fills AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 $xfer: movq mm0, [eax] add edx, 64 movq mm1, [eax+8] add .

Pagina 133

Use MMX ™ Instructio ns for Block Copies and Block Fills 117 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization AM D Athlon ™ Proc essor Specific Code T he f ollo wing .

Pagina 134

118 Us e M MX ™ PXOR to Clear All Bits in an M MX ™ Register AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 /* block fill (destination QWORD aligned) */ __asm { mov.

Pagina 135

Use MMX ™ PCMPEQD to S et All Bits in an MMX ™ Register 119 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Use M MX ™ PC M P E QD to Set All Bits in an M MX ™ R.

Pagina 136

12 0 Optimized Matrix Multip lication AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 /* Function XForm performs a fully generalized 3D transform on an array of vertices pointed to by "v" and stores the transformed vertices in the location pointed to by "res".

Pagina 137

Optimized Matrix Multipli cation 121 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization $$xform: ADD EBX, 16 ;res++ MOVQ MM0, QWORD PTR [EDX] ;v->y | v->x MOVQ MM1, Q.

Pagina 138

12 2 Efficient 3D- Clipping Code Computation Using AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Efficient 3D- Clipping Code Computation Using 3D Now! ™ Instructions Clipping is one of the major acti vities occurring in a 3D gr aphics pipeli ne.

Pagina 139

Use 3DNow! ™ PAVGUSB for MPEG-2 Motion Compensation 12 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ;; ;; DESTROYS MM0,MM1,MM2,MM3,MM4 PXOR MM0, MM0 ; 0 | 0 MOVQ .

Pagina 140

12 4 Use 3DNow! ™ P A VG US B for MP EG-2 Motion AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Example 1 (Av oid): MOV ESI, DWORD PTR Src_MB MOV EDI, DWORD PTR Dst_M.

Pagina 141

Stream of Packed Unsi gned Bytes 12 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T he f ollo wing code fr agment us es the 3DNo w! P A V GUSB instruction to perform.

Pagina 142

12 6 Co mple x N umbe r Ari thm etic AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Complex Number Arithmetic Complex n umbers ha v e a “ real ” part and an “ imaginar y ” part. Multipl ying complex number s (ex.

Pagina 143

22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Short Forms 12 7 11 Gener al x86 Optimization Guidelines T his c hapter describes gener a l code optimization tec hniques.

Pagina 144

12 8 Dependencies AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Depend encies Spr ead out true dependencies to increase the opportunities f or par allel execution. Anti- depende ncies and output dependencies do not impact performance.

Pagina 145

22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Introduction 12 9 Appendix A AM D Athlon ™ Proc essor Micr oarc hitecture Intr oduction W hen discussing processor design, it is important to unders tand the follo wing terms — architecture , microarchitectur e , and design implementation .

Pagina 146

130 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 AM D Athlon ™ Proc essor Microar chitecture T he innov ativ e AMD Athlo.

Pagina 147

AMD Athlon ™ Processor Mic roarchitecture 131 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization Figure 1 . AM D Athlon ™ Processor Block Diagram Instruction Cache T he o ut-of-or der ex ecute engi ne of t he AMD Athlon proc essor contains a v ery larg e 64- Kbyte L1 ins truction cac he.

Pagina 148

132 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 r eplacement is based on a least- r ecently used (LR U ) r eplacement algori thm. T he L1 instruction cac he has an associated tw o-le v el tr anslation look- aside buffer (TLB) structur e.

Pagina 149

AMD Athlon ™ Processor Mic roarchitecture 13 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization r eturn stack. Subsequen t RETs pop a p r ed icted return addr ess off the top of the stac k. Early Dec oding T he Dir ectP ath and V ectorP ath decoders perf orm ear ly- decoding of instructions into Macr oOPs.

Pagina 150

134 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 Instruction Control Unit T he instruction contr ol unit (ICU) is the contr ol center f or the AMD Athlon processor .

Pagina 151

AMD Athlon ™ Processor Mic roarchitecture 13 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization Integer Scheduler T he integer s che duler is ba sed on a thr ee- wide queuing system (also kno wn as a r eserv ation station) that feeds thr e e integer executi on positions or pipes.

Pagina 152

136 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 Eac h of the three IEUs ar e general purpose in that eac h performs lo gic functions, arithmetic functions, conditional functions, di vide step functions, status flag multiplexing, and br anc h r esolutions.

Pagina 153

AMD Athlon ™ Processor Mic roarchitecture 13 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization Floa ting-P oint Ex ecutio n Unit T he floating-point execution unit (FPU) is implemented as a coprocessor that has its o wn out-of- ord er control in addition to the da ta path.

Pagina 154

138 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 Load-Store Unit (LS U ) T he load-s tor e unit (LSU) manages dat a load and s tor e accesses to the L1 dat a cache and, if r equired, to the backside L2 cache or system memory .

Pagina 155

AMD Athlon ™ Processor Mic roarchitecture 13 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization L2 Cache Controller T he AMD Athlon processor contai ns a v ery flexible onboar d L2 contr oller . It uses an independent bac kside bus to access up to 8-Mb ytes of industry- standar d SRAMs.

Pagina 156

140 AMD Athlon ™ Processor Micro architectur e AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9.

Pagina 157

22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Fetch and Dec ode Pipeline Stages 141 Appendix B Pipeline and Execution Unit R esourc es Ov erview Th e A M D A t h l o n ™ pr ocessor contains two independent execut ion pipelines — one for integer oper ations and one for floating-point operations.

Pagina 158

142 Fetch and Dec ode Pipeline Stages AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Figure 5. F etch/Scan/Align/D ecode Pipeline Hardware T he most common x8 6 instructions flo w throug h the Dir ectP ath pipeline stages and are decoded by har dw a r e .

Pagina 159

Fetch and Dec ode Pipeline Stages 14 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Cycle 1 – FET CH The FETCH pipeline stag e calculates t he addr ess of the next x86 instr uction window to fetch from the pr oce ssor caches or system me mory .

Pagina 160

144 Integer Pipeline Stages AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 oper ands mapped to r egisters. Both integer and floating-point Macr oOPs ar e placed into the IC U .

Pagina 161

Integer Pipelin e Stages 14 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Cycle 7 – SC H E D In the scheduler (SCHED) pipeline stage, the scheduler buffer s can cont ain Macr oOPs that are waiting f or integer operands fr om the ICU or the IEU r esult bus .

Pagina 162

146 Floating-Point Pipe line Stages AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Floating-P oint Pipeline Stages T he floa ting-point unit (FPU) is implemente d as a coprocessor that has its o w n out- of- or der cont r ol in addition to the data path.

Pagina 163

Floating-Point P ipeline Stages 14 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Cycle 7 – ST K R E N T he stack r ename (S TKREN) pipeline stage in cycle 7 r eceiv e s up to thr ee Macr oOPs fr om IDEC and maps stac k- relati ve r egi ster tag s to vir tual register ta gs.

Pagina 164

148 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Execution Unit Resour ces Te r m i n o l o g y T he execution units o perate with two types of register v al ues — operands and res u lt s .

Pagina 165

Execution Unit Resources 14 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Integer Pipeline Operations T abl e 2 shows the categor y or type of o per ations handled b y the integer pipeline. T able 3 sho w s examples of the decode type.

Pagina 166

150 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Floa ting-P oint P ipeline Oper ations T abl e 4 shows the categor y or type of o per ations handled b y the floating-point execution units. T able 5 sho ws examples of the decode types.

Pagina 167

Execution Unit Resources 151 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Load/Store Pipeline Oper ations T he AMD Athlon pr ocessor decodes an y instruction that r efer ences memor y into primiti ve load/stor e oper a tions.

Pagina 168

152 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Code Sample Analysis T he samples in T able 7 on page 153 and T able 8 on page 154 show the execut ion behavior of sev eral serie s of ins tructi ons as a function of decode constr aints, dependenc ies, and execution r esour ce constr aints.

Pagina 169

Execution Unit Resources 15 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T a ble 7 . Sample 1 – Integer Register Operations Inst ructi on Number Deco de Pipe Deco.

Pagina 170

154 Execution Unit Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T a ble 8. Sample 2 – Integer Reg ister and Memory Load Operations Instruc Num Decode Pip.

Pagina 171

22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Introduction 15 5 Appendix C Implementation of W rite Combining Intr oduction T his appendix describes the memory write- c ombining featur e as implemente d in the AMD Athlon ™ pr ocessor famil y .

Pagina 172

15 6 Write-Combinin g Definitions and Abbrev iations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 W rite-Combining Definitions and Abbr eviations T his appendix uses .

Pagina 173

Write-Combining Operations 15 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization signatur e in r egister EAX, wher e EAX[11 – 8] contai ns the instruction famil y code. F or the AMD Athlon processor , the instruction famil y code is six .

Pagina 174

15 8 Wr ite- Combining Operations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T a ble 9. W rite Combining Completion Events Event Comment Non-WB write outside o f current buffer The first non-WB write to a different cache block address closes combining for previous writes.

Pagina 175

Write-Combining Operations 15 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Sending W rite-Buffer Data to the System Once write combining is closed f or a 64- byte write buffer , the contents of the write buffer ar e eligible to be sent to the system as one or more AMD Athlon system bus commands.

Pagina 176

16 0 W rite- Combining Operations AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9.

Pagina 177

22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Over view 16 1 Appendix D P erformance-Monitoring Counters T his c hapter describes ho w to use the AMD Athlon ™ processo r perf ormance monitoring counters.

Pagina 178

16 2 Performance Counter Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T hese r egisters can be r ead from and written to using t he RDMSR and WRM SR instructions, r espectiv el y . T he P erfEvtSel[3 :0] r egister s ar e locat ed at MSR l ocations C001_0000h to C0 01_0003h.

Pagina 179

Performance Counter Usage 16 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Unit Mask Field (Bits 8 — 15 ) Th ese bits are used to further qualify the e vent sel ected in the e v ent select fi eld. F or e xample, f or some cac he ev ents, the ma sk is used as a MESI- pr otocol qualifier of cac he states.

Pagina 180

16 4 Per formance Counter Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 greater than or equal to the counter mask. Otherwise if this field is zero , then the counte r increm ents by the total n umber of even t s .

Pagina 181

Performance Counter Usage 16 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization 65h BU 1xxx_xxxxb = reserved x1xx_xxxxb = WB xx1x_xxxxb = WP xxx1_xxxxb = WT bits 11–10 .

Pagina 182

16 6 Performance Counter Usage AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 7Ah B U C ycles that at least one fill request waited to use the L2 80h PC Instr uctio n c.

Pagina 183

Performance Counter Usage 16 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization P erfCtr[3:0] M S Rs (M S R Addr esses C00 1 _000 4h – C00 1 _000 7h) T he performance-counter MSRs contain the e vent or dur ation counts for the se lecte d ev ents b eing count ed.

Pagina 184

16 8 Event and Time-S tamp Monitoring Softwar e AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 allo ws writing both positi ve and negativ e va lues to the perf ormance counters . The perf ormance counter s ma y be initializ ed us ing a 64-bit sig ned integer in the r ange -2 47 and +2 47 .

Pagina 185

Monitoring Counter Ov erflow 16 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T he initialization and start counter s pr ocedur e sets the P erfEvtSel0 and/ or P erfEvtSel1 MSRs for the e v ents to be counted and the method used to count them and init ializ es the counter MSR s (P erfCtr[3:0]) to starting counts.

Pagina 186

17 0 Monitoring Counter Overflow AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 An e v ent moni tor application util ity or another application pr ogr am can r ead the collected perf ormance inf ormation of the pr ofiled a pplication.

Pagina 187

22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Introduction 171 Appendix E Progr amming the M TR R and PA T Intr oduction Th e A M D A t h l o n ™ processor includes a set of memor y type and r ange register s (MTRRs) to control cachea bility and access to spec ified m emor y re gions.

Pagina 188

17 2 Memory T ype Range Register (MTRR) Mechanism AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T her e ar e two types of ad dr ess r anges: fixed and v a ria ble. (See F i gur e 12.) F or each addr ess r a nge, ther e is a memo ry type.

Pagina 189

Memory Type Ra nge Register (MTRR) Mechan ism 17 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Figure 1 2. MTRR Mapping of Physic al Memory 0 FFFFFFFF h 512 K b y t .

Pagina 190

17 4 Memory T ype Range Register (MTRR) Mechanism AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Memory T ypes F iv e standard memor y types ar e defi ned b y the AMD At hlon pr ocessor: writethr ough (WT), write back (WB), wr ite-pro tect (WP), write-combining (WC) , and uncachea ble (UC).

Pagina 191

Memory Type Ra nge Register (MTRR) Mechan ism 17 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization MTRR D efault T ype Register Format. T he MTRR def ault type r egister is defined as f ollows. Figure 1 4. MTRR Default T ype Register Format E MTRRs ar e ena bled when set.

Pagina 192

17 6 Memory T ype Range Register (MTR R) Mechanism AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Note that if tw o or mor e v ariable m emor y r anges matc h then the inter actions ar e defined as f ollows: 1. If the memor y types ar e identical, then that memor y type is used.

Pagina 193

Page Attribute Tabl e (PAT) 17 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization not affected b y this issue, onl y the v ariable r ange (and MTRR DefT ype) r egi sters are affecte d.

Pagina 194

17 8 Page Attribute T able (P A T) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Accessing the P A T A 3-bit inde x consisting of the P A T i, PCD , and PWT bit s of t.

Pagina 195

Page Attribute Tabl e (PAT) 17 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T a ble 1 5. Effective Memor y T ype Based on P A T and MTR Rs P A T Memory T ype MTRR M.

Pagina 196

18 0 Page Attribute T able (P A T ) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T a ble 1 6. Final Output Memory T ypes Input Memory T ype Output Memory T ype Note RdMem WrM e m Effective.

Pagina 197

Page Attribute Tabl e (PAT) 18 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ● ● CD - ●● CD ●● WC - ●● WC ●● WT - ●● WT ●● WP - ●● WP ●● WB - ● ● WT 4 ●● - ●● ● CD 2 Notes: 1 .

Pagina 198

18 2 Page Attribute T able (P A T) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 MTR R Fixed-Range Register F ormat T he memor y types defined f or memor y segments defined in eac h of the MTRR fixed-r ange r egist er s ar e defined in T a ble 17 (Also See “ Standar d MTRR T ypes and Pr operties ” on page 176.

Pagina 199

Page Attribute Tabl e (PAT) 18 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization V ariable-Range MTRRs A v ariable MTRR can be pro gramm ed to st art at ad dr ess 0000_0000h bec ause the fixed MTRRs alw ays o verride the v aria ble ones.

Pagina 200

18 4 Page Attribute T able (P A T) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 Figure 1 7 . MTR RphysMask n Register F ormat Note: A softwar e attempt to write to reser ved bits will generate a general protection exception.

Pagina 201

Page Attribute Tabl e (PAT) 18 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization MTRR M SR F ormat T his table defines the model-specifi c r egister s re lated to the memor y type range r egister implementation. All MTRRs ar e defined to be 64 bits.

Pagina 202

18 6 Page Attribute T able (P A T ) AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9.

Pagina 203

22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Instruction Dispatch and Execution Resou rces 18 7 Appendix F Instruction Dispatch and Execution Resourc es T his c hapter describes the Macr oOPs gener ated by eac h decoded instruction, along with the r elativ e static execution latencies of these groups of operations.

Pagina 204

18 8 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ■ disp1 6/32 — 16-bit or 32-bit displacem ent v alue ■ disp3 2/4.

Pagina 205

Instruction Dispatch and Execution Resou rces 18 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization ADC mreg8, reg8 1 0h 1 1-xxx-xxx DirectPath ADC mem8, r eg8 1 0h mm-xx.

Pagina 206

19 0 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 AN D mem8, reg8 20h mm- xxx-xxx Dir ectPath AN D mreg1 6/ 32, reg1 6/3.

Pagina 207

Instruction Dispatch and Execution Resou rces 19 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization BT mem1 6/32, imm8 0Fh BAh mm-1 00-xxx DirectPath BT C mreg1 6/32, reg.

Pagina 208

19 2 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 CMOVE/C MOVZ reg1 6/32, reg1 6/32 0Fh 44h 1 1-xxx-xxx DirectP ath CMOVE.

Pagina 209

Instruction Dispatch and Execution Resou rces 19 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization CM P EA X, imm1 6/32 3Dh DirectPath CM P mreg8, imm8 80h 1 1-1 1 1-xxx.

Pagina 210

19 4 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 DIV EA X, mreg1 6/32 F7h 1 1-1 1 0-xxx V ectorPath DIV EA X, mem1 6/32 .

Pagina 211

Instruction Dispatch and Execution Resou rces 19 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization INC mreg8 FEh 1 1-000-xxx DirectPath INC mem8 F Eh mm-000-xxx DirectPa.

Pagina 212

19 6 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 J P/JP E near disp1 6/32 0Fh 8Ah DirectPath J NP/J PO near disp1 6/32 .

Pagina 213

Instruction Dispatch and Execution Resou rces 19 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization L OOP E/L OOPZ disp8 E1h V ectorPath L OOPN E/L OOP NZ disp8 E0h V ect.

Pagina 214

19 8 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 MOV EDX, imm1 6/32 BAh DirectPath MOV EBX, imm1 6/32 BBh DirectPath MO.

Pagina 215

Instruction Dispatch and Execution Resou rces 19 9 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization NOT mem8 F6h mm-0 1 0- xx DirectPath NOT mreg1 6/32 F7h 1 1-0 1 0-xxx .

Pagina 216

200 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 POP EB X 5Bh V ectorPath POP ES P 5Ch VectorP ath POP EB P 5Dh V ectorP.

Pagina 217

Instruction Dispatch and Execution Resou rces 20 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization RCL mreg8, 1 D0h 1 1-0 1 0-xxx DirectPath RC L mem8, 1 D0h mm- 0 1 0-x.

Pagina 218

202 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 ROL mreg1 6/32, 1 D1h 1 1-000-xxx DirectPath ROL mem1 6/32, 1 D1h mm- 0.

Pagina 219

Instruction Dispatch and Execution Resou rces 203 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization SB B mreg1 6/32, reg1 6/32 1 9h 1 1-xxx-xxx DirectPath S BB mem 1 6/32,.

Pagina 220

204 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 S ETS mreg8 0Fh 98h 1 1-xxx -xxx DirectPath S ETS mem8 0Fh 98h mm-xxx -x.

Pagina 221

Instruction Dispatch and Execution Resou rces 205 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization SH R mem1 6/32, imm8 C1h mm-1 0 1-xxx DirectPath SH R mreg8, 1 D0h 1 1-.

Pagina 222

206 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 S UB r eg8, mreg8 2Ah 1 1-xxx -xxx DirectPath S UB r eg8, mem8 2Ah mm-x.

Pagina 223

Instruction Dispatch and Execution Resou rces 207 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization X ADD mreg8, reg8 0Fh C0h 1 1 -1 00-xxx V ectorPath XA DD mem8, r eg8 0.

Pagina 224

208 Instruction Dispatch and Execution Resour ces AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 T able 20. M MX ™ Instruct ions Instruct ion Mnem onic Prefix By t e(.

Pagina 225

Instruction Dispatch and Execution Resou rces 209 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization P AN DN mmreg1 , mmreg2 0Fh DFh 1 1-xx x-xxx DirectPath F ADD/F M UL P .

Pagina 226

210 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 PS R AW mmreg1 , mmreg2 0Fh E1h 1 1-xxx-xxx DirectPath F ADD/F M UL P SR.

Pagina 227

Instruction Dispatch and Execution Resou rces 21 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization P UN PCK HDQ mmreg1 , mmreg2 0Fh 6Ah 1 1-xxx-xxx DirectPath F ADD/FM U.

Pagina 228

212 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 PM I NSW mmreg, mem64 0F h EAh mm-xxx-xxx Direct Path F ADD/FM U L PM I .

Pagina 229

Instruction Dispatch and Execution Resou rces 21 3 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization FCMOVB ST(0), ST(i) DAh C0- C7h VectorP ath FCMOVE ST(0), ST(i) DAh C .

Pagina 230

214 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 FIADD [mem32int] DAh m m-000-xxx V ectorPath FIADD [mem1 6int] DEh mm-00.

Pagina 231

Instruction Dispatch and Execution Resou rces 21 5 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization FLD CW [mem1 6] D9h mm-1 0 1-xxx V ectorPath FLD ENV [mem1 4byte] D 9h.

Pagina 232

216 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 F S T C W [ m e m 16 ] D 9 h m m - 111 - x x x V e c t o r P a t h FSTE .

Pagina 233

Instruction Dispatch and Execution Resou rces 21 7 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization T a ble 23. 3DNow! ™ Instructions Instru ction Mn emonic Prefix Byte.

Pagina 234

218 Instruction Dispatch and Execution Resources AMD Athlon ™ Processor x86 Code Optimizatio n 22007E/0 — Novemb er 1 9 9 9 PF R SQRT mmr eg, mem64 0F h, 0Fh 9 7h mm-xxx-xxx DirectPat h F MU L P F.

Pagina 235

22007E/0 — No ve mb er 1 999 AMD Athlon ™ Proce ssor x86 Code Optimization Select DirectP ath Over VectorPath Instruc tions 219 Appendix G Dire ctP ath versus V ectorP ath Instructions Select DirectP ath Over V ectorP ath Instructions Use DirectP ath instructions rather than V ectorPath ins tr ucti on s.

Pagina 236

220 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 T able 25. DirectP ath Integer Instructions Instru ction Mn emonic ADC mreg8, reg8 ADC mem8, .

Pagina 237

DirectPath Instructi ons 22 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization CMOVBE/C MOVNA reg1 6/32, reg1 6/32 CMOVBE/C MOVNA reg1 6/32, mem1 6/32 CMOVE/C MOVZ reg1 6/.

Pagina 238

222 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 JN O s ho rt di sp 8 JB /JNAE short disp8 JN B/JAE short disp8 JZ/J E short disp8 J NZ/JN E s.

Pagina 239

DirectPath Instructi ons 223 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization MOV mem1 6/32, imm1 6/32 MOVSX reg1 6/32, mreg8 MOVSX reg1 6/32, mem8 MOVSX reg32, mreg1 6 MO.

Pagina 240

224 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 ROL mreg8, CL ROL mem8, CL ROL mreg1 6/32, CL ROL mem1 6/32, CL ROR mreg8, i mm8 ROR mem8, im.

Pagina 241

DirectPath Instructions 225 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization SE TL/S ETNG E mreg8 SE TL/SE TNGE mem8 SE TGE/SE TNL mreg8 SET GE/ SETNL mem 8 SE TLE/S ETNG .

Pagina 242

226 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 XO R reg1 6/32, mem1 6/32 XOR AL, imm8 XO R EA X, imm1 6/32 XOR mreg8, imm8 X OR mem8, imm8 XOR m reg 1 6 /32 , imm 1 6/32 X OR mem1 6/32, imm1 6/32 XO R mreg1 6/32, imm8 (sign extended) XO R mem1 6/32, imm8 (sign extended) T able 25.

Pagina 243

DirectPath Instructi ons 227 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization T able 26. DirectP ath M MX ™ Instructions Instruct ion Mnem onic EMMS MOVD mmreg, mem32 MO.

Pagina 244

228 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 PS R LD mmreg, imm8 PS R LQ mmreg1 , mmreg2 PS R LQ mmreg, mem64 PS R LQ mmreg, imm8 PS R L W.

Pagina 245

DirectPath Instructi ons 229 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization T able 28. DirectP ath Floating-Point Instructions Instruct ion Mnem onic FA B S F ADD ST, ST.

Pagina 246

230 DirectPat h Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — November 1 9 9 9 FS U B ST(i), ST FS U BP ST, ST(i) FS U BR [mem32real] FS U BR [mem64real] FS U BR ST, ST(i) FS U BR ST(i), ST FS U BR P ST(i), ST F TST FUC OM FUC OMP FUC OMPP FW A IT FXCH T able 28.

Pagina 247

V ectorPath Instructions 23 1 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization V ectorP ath Instructions T he f ollowi ng ta bles contain Ve c t o r P a t h instructions, .

Pagina 248

232 V ectorPath Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — Novembe r 1 9 9 9 DIV EA X, mem1 6/32 EN TER IDIV mr eg8 IDIV mem8 IDIV E A X, mreg1 6/32 IDIV E A X, mem1 6/3.

Pagina 249

V ectorPath Instructions 233 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization MUL EAX , m em 3 2 OUT imm8, A L OUT imm8, A X OUT imm8, E A X OUT DX, AL OUT DX, A X OUT DX,.

Pagina 250

234 V ectorPath Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — Novembe r 1 9 9 9 STI ST OS B mem8, AL ST OSW mem1 6, A X STOSD mem32, EA X STR mreg1 6 STR mem1 6 SYSC ALL SY.

Pagina 251

V ectorPath Instructions 235 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Processor x86 Code Optimization T able 32. V ectorPath Floating-P oint Instructions Instruct ion Mnem onic F2XM1 FB LD [mem80.

Pagina 252

236 V ectorPath Instructions AMD Athlon ™ Processor x 86 Code Optimization 22007E/0 — Novembe r 1 9 9 9.

Pagina 253

Index 237 22007E/0 — No ve mb er 1 999 AMD Athlon ™ Pr ocessor x86 Code Optimization Index Numerics 3DNow! ™ Inst ructions . . . . . . . . . . . . . . . . . . . . . . . . . . 10 , 107 3DNo w! and MMX ™ Intr a-Oper and Swapping . . . . . . . 112 Clippin g .

AMD x86 handleiding

Deel URL

Vergelijkbare instructies