Ilkyu Song - SPO600

Sunday 22 April 2018

SPO600 Project - Stage 3

I chose Redis (Remote Dictionary Server) for my project at stage1. Redis is open source software developed by Salvatore Sanfilippo, a volatile and persistent key-value store. Then, it stores and manages the data in the memory. Let's look at the benefits and data types of Redis.

1. The advantages of The Redis

Advantage	Description
Specialized for processing data in lists and arrays.	• The value supports several data types such as string, list, set, sorted set, hash type. • List type data entry and deletion are about 10 times faster than MySQL.
Redis transaction is also atomic.	Atomic processing provides an Atomic processing function to prevent data mismatch when several processes simultaneously request the same key update.
Persistent data preservation while utilizing memory.	• Do not delete data unless explicitly deleted with the command or set expires. • The snapshot function allows you to save the contents of the memory as *.rdb file and restore it to that point in time.
Multiple server configurations.	Consistent hashing or master-slave configuration

2. Redis provides five data types, and there are many processing instructions for each data type.

Data Type	Description
String	• We cannot just store a string as a string, • Binary data can also be saved (note that Redis does not have integer or real numbers). • The maximum size of data that can be inserted into the key is 512 MB.
List	• We can think of it as an array. • The maximum number of elements in a key is 4,294,967,295. • If the value of the data type is larger than the condition set in the configuration file, it is encoded as a linked list or zip list.
Set	• There is no duplicate data in key with unaligned aggregate type • The amount of time spent adding, removing, and checking existence is constant regardless of the number of elements in the sets • The maximum number of elements that can be in a key is 4,294,967,295
Sorted sets	• Sorted sets are called the most advanced Redis datatype. • Adding, removing, and updating elements are done in a very fast way, which is proportional to the "number of elements in the log". • It can be used in linking systems. • Each element of sets has a real value called score and is sorted in ascending order by score value. • There is no redundant data in the key, but the score value can be duplicated
Hashes	• Similar to lists, consisting of a series of "field names" and "field values". • The maximum number of field-value pairs that can be contained in a key is 4,294,967,295.

I compiled the benchmark file for the Redis benchmark by changing the compile options. And benchmarks were done on aarchie and x86 servers with different command counts. The result below is the number of commands executed per second. The aarchie server is a bit faster than the x86 server, although it did not show much difference from the test in stage1. However, the specifications of the two servers are so different that simple comparison is difficult. Some developers and architectures tend to look at performance only with code without considering hardware specs. However, the first stuff to consider when tuning database or optimizing code is the hardware specification.

1. aarchie

2. x86

Moreover, I ran the benchmark once again in stage2. I chose the Redis library source in stage 2 and benchmarked it. Then I used the ASM inline assembler to optimize the code. However, ASM does not guarantee optimization over C language. It is better to use c language first for optimization. The two figures below show the result of using the original source and ASM. The two results are very similar.

I am performing Stage3 and thinking about code optimization again. Code optimization is a program conversion technique that improves code by consuming fewer resources (ie, CPU, memory), resulting in faster machine code generation. I think I should remind this meaning. I tried to convert only the code to a simple knowledge what I knew for code optimization. I thought that converting only the code would speed up execution, and I thought that changing the compile options would speed up the program. However, in a simple program, the difference is not so different. I have to keep a few things in mind for code optimization. First of all, I need to know exactly the environment of the OS or platform where my program will run. (Actually, the library which I chose on Stage2 did not run on x86.) And I think I should have a knowledge of the specs of the machine on which my program will run. So, I need to provide the user with the minimum recommended specification for my program. Finally, you should benchmark it repeatedly over and over. To make a good program, I have to test it repeatedly many times. If I follow these three things, I will be able to develop a program that is nearest to optimization. As I proceeded with this course project, I was not only knowledgeable about code optimization, but also experienced. I think in programming as well as coding skills, experience is very important to programmers. This experience will be very beneficial to me. And this project taught me how to perform in the upstream. And code optimization and portability are not simply changing the programming code. I have to be knowledgeable about all operating systems, platforms and hardware.

Tuesday 10 April 2018

SPO600 Project - Stage 2

In this Stage 2, I am going to learn deeply the Redis selected in Stage1. Redis (Remote Dictionary Server) is an in-memory-based key-value store. Performance is faster than memory-based databases, as it directly processes the data into memory. In the case of data types that can be stored, the rest of the repository provides only the primitive types, while the rest of the data types are data types such as String, Set, Hash, List, And provides basic functions such as search, add, and delete of data.
First, I chose a Redis client library to test the Redis database. It is Hiredis. Hiredis provides the APIs needed to manipulate Redis as a C client library. And, I will send a 1,000,000 commands to the server to benchmark using Hiredis. In relation to this, I picked c language source in a Github, and I modified this source. The Github URLs are below.

Hiredis Client Library Git: https://github.com/redis/hiredis
Benchmark Git: https://github.com/stefanwille/redis-client-benchmarks

This source is the source for the benchmark.

const int N = 1000000;

int main() {
    printf("Connecting...\n");
    redisContext *redis = redisConnect("localhost", 6379);

    clock_t start, end;
    float ftime;

    if (redis->err) {
        puts(redis->errstr);
        char *p = redis->errstr;
        while (*p) {
            printf("%x ", (int)(*p++));
        }
        printf("\n");
    }
    else {
        start = clock();

        char *cmd;
        int len;

        for (int i = 0; i < N; i++) {
            len = redisFormatCommand(&cmd, "HSET myset:__rand_int__ element:__rand_int__");

            redisAppendFormattedCommand(redis, cmd, len);
        }

        for (int i = 0; i < N; i++) {
            redisReply *reply;
            assert(redisGetReply(redis, (void*)&reply) == REDIS_OK);
            redisGetReply(redis, (void**)&reply);

            freeReplyObject(reply);
        }

        end = clock();

        ftime = (float)(end - start) / CLOCKS_PER_SEC;

        printf("Runing Time: %f sec. \n", ftime);
    }
}

I analyzed the Hiredis source to implement optimization for the function. Unfortunately, it was not easy to find a place to optimize. So I decided to apply the optimization what I learned in this lecture. The source below invokes the redisFormatCommand function of the Hiredis library. So I looked for a part of the function that I could optimize. The redisFormatCommand invokes the redisvFormatCommand function. So I figured out where to optimize the function and implemented the optimization. Of course, this may not be the right way for optimization. However, I wanted to make sure if performance improvements can be made when optimizing in small parts.

The source below is a modified source of Hiredis.

    while(*c != '\0') {
        if (*c != '%' || c[1] == '\0') {
            if (*c == ' ') {
                if (touched) {


                    //Change Source for optimization
                    int result;
                    __asm__ __volatile__("add %0,%1,%2 \n\t":"=r" (result) : "r"(argc), "r"(1));
                    newargv = realloc(curargv, sizeof(char*)*(result));

                    //newargv = realloc(curargv,sizeof(char*)*(argc+1)); - Original Source
                    if (newargv == NULL) goto memory_err;
                    curargv = newargv;
                    curargv[argc++] = curarg;
                    totlen += bulklen(sdslen(curarg));

                    /* curarg is put in argv so it can be overwritten. */
                    curarg = sdsempty();
                    if (curarg == NULL) goto memory_err;
                    touched = 0;

Now let's benchmark using the Hiredis library. The first picture is the one executed prior to optimization. And the second figure shows the results after optimization. I did not change the results dramatically because I modified a very small part. However, you can see that performance has improved a bit since you have executed 1,000,000 commands and executed optimizations in the while statement. I benchmarked the Hset datatype, compiled it with different compile options, and checked the speed. The compile option -03 is the most optimized result. However, there is not much difference depending on the option.

The result of original

The result of optimization.

I modified the source code to run on the x86 platform.

    while(*c != '\0') {
        if (*c != '%' || c[1] == '\0') {
            if (*c == ' ') {
                if (touched) {

                    //Change Source for optimization
                    int result;
                    __asm__("addl %%ebx, %%eax;" : "=a" (result) : "a" (argc), "b" (1));
                    newargv = realloc(curargv,sizeof(char*)*(argc+1));

                    //newargv = realloc(curargv,sizeof(char*)*(argc+1)); - Original Source
                    if (newargv == NULL) goto memory_err;
                    curargv = newargv;
                    curargv[argc++] = curarg;
                    totlen += bulklen(sdslen(curarg));

                    /* curarg is put in argv so it can be overwritten. */
                    curarg = sdsempty();
                    if (curarg == NULL) goto memory_err;
                    touched = 0;

I modified the source code to run on the x86 platform.
I have tried hard to run the Hiredis library and this code on the x86 platform, but I have not found a way. So I tried to run it on the Windows platform, but the Redis libraries were not suitable for running on Windows. The following picture shows errors when running the Hiredis library on an x86 platform. Even the example sources supported by Redis have not been compiled.

I learned a few important things through this stage. First, hash-based data is very fast and useful. In addition, the hash data can be used on any platform. However, it is only possible to configure the library well. In addition, I was able to understand the process of working together on Github. This is a very useful and essential stuff for programmers. Best of all, I found that even a small amount of code optimization can improve the performance of the system.

Sunday 18 March 2018

SPO600 Project - Stage 1

REDIS (Remote Dictionary Server)

I chose the Redis open source package for the SPO600 Project Stage1.
Redis (Remote Dictionary Server) is a kind of NoSQL for storing and managing non-relational data of 'key-value' structure. It was first developed by Salvatore Sanfilippo in 2009. Redis Labs has been supporting it since 2015. It is a memory-based DBMS that loads and processes all data into memory. Moreover, Redis follows the BSD license. According to DB-Engines.com's monthly rankings, Redis is the most popular key-value store.

Redis supports various data types.

String: Supports up to 512 Mbytes in length as a regular string. It can store not only Text strings, but also binary files such as Integer and numbers and JPEGs.

Sets: Redis is a random ordered set of strings. Set can operate intersection, difference of set, etc., but it is very fast.

Sorted Set: A data type with a field named "Score" added to the Set. This is similar to sets.

Sorted sets are very fast to add, delete, and update elements.

Hashes: A data type that can store a pair of field/string values within a value.

List: A simple list of strings. They are sorted by the input order.

First of all, I had to install Redis. Learn how to install and run. The installation files and how to run them are on the Redis site. Redis is also open source and can be downloaded from GitHub.

Redis site: https://redis.io/

Github: https://github.com/antirez/redis

$ wget http://download.redis.io/releases/redis-4.0.8.tar.gz
$ tar xzf redis-4.0.8.tar.gz
$ cd redis-4.0.8
$ make

$ src/redis-server
$ src/redis-cli

Let's test a simple key-value.

In the above image, you can see that Redis is installed normally. If so, let's try measuring the performance of Redis. Redis performance measurements are provided by Redis itself. The performance of all software depends on the specifications of the server. So let's first look at the specification of the server.

First, let's check the processing speed according to the amount of data. The unit is bytes. Up to 10000 bytes, the data processing speed does not show much difference, but it is sharply decreasing.

Second, the time taken to process 1000000 commands was measured for each function. We can see that it takes similar processing time except Get function

I tested separately the Hset function which stored by the Hash field. Because the Hset function will be tested in detail on Stage2.

Monday 12 March 2018

Lab6 - Inline Assembler

Part1.

The inline assembler is an assembler that allows the use of assembly language at the language level and translates the assembly language between the codes into the machine language when the compiler translates the source code. In other words, it refers to using assembler commands directly in high-level languages. In this lab, we will be testing the inline assembler. First, run the example source on the aarchie server.

// vol_simd.c :: volume scaling in C using AArch64 SIMD
// Chris Tyler 2017.11.29-2018.02.20

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"

int main() {

    int16_t*        in;        // input array
    int16_t*        limit;        // end of input array
    int16_t*        out;        // output array

    // these variables will be used in our assembler code, so we're going
    // to hand-allocate which register they are placed in
    // Q: what is an alternate approach?
    register int16_t*    in_cursor     asm("r20");    // input cursor
    register int16_t*    out_cursor    asm("r21");    // output cursor
    register int16_t    vol_int        asm("r22");    // volume as int16_t

    int            x;        // array interator
    int            ttl;        // array total

    in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
    out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

    srand(-1);
    printf("Generating sample data.\n");
    for (x = 0; x < SAMPLES; x++) {
        in[x] = (rand()%65536)-32768;
    }

// --------------------------------------------------------------------

    in_cursor = in;
    out_cursor = out;
    limit = in + SAMPLES ;

    // set vol_int to fixed-point representation of 0.75
    // Q: should we use 32767 or 32768 in next line? why?
    vol_int = (int16_t) (0.75 * 32767.0);

    printf("Scaling samples.\n");

    // Q: what does it mean to "duplicate" values in the next line?
    __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

    while ( in_cursor < limit ) {
        __asm__ (
            "ldr q0, [%[in]],#16        \n\t"
            // load eight samples into q0 (v0.8h)
            // from in_cursor, and post-increment
            // in_cursor by 16 bytes

            "sqdmulh v0.8h, v0.8h, v1.8h    \n\t"
            // multiply each lane in v0 by v1*2
            // saturate results
            // store upper 16 bits of results into v0
            
            "str q0, [%[out]],#16        \n\t"
            // store eight samples to out_cursor
            // post-increment out_cursor by 16 bytes

            // Q: what happens if we remove the following
            // two lines? Why?
            : [in]"+r"(in_cursor)
            : "0"(in_cursor),[out]"r"(out_cursor)
            );
    }

// --------------------------------------------------------------------

    printf("Summing samples.\n");
    for (x = 0; x < SAMPLES; x++) {
        ttl=(ttl+out[x])%1000;
    }

    // Q: are the results usable? are they correct?
    printf("Result: %d\n", ttl);

    return 0;

}

The image below is the result of running the above code.

Now, let's answer the questions in the code.

Q: what is an alternate approach?
An alternative approach is not to assign a value to an object, but to allow the object to recognize the value. Therefore, declare a register variable without assigning a value.

Q: should we use 32767 or 32768 in next line? why?
The range of integer values is -32,768 to 32,767. Therefore, we must use the value 32,767.

Q: what does it mean to "duplicate" values in the next line?
It means to store the value of vol_int variable in v1.8h.

Q: what happens if we remove the following two lines? Why?
If these two lines are erased, a segment fault will occur.

Q: are the results usable? are they correct?
Yes, the results are correct and usable.

The assembly is 1:1 matched with the machine language. The C language combines several CPU instructions into a single statement. Eventually, it can be seen to the extent that you can easily express the part that is a little handy when you implement the assembly.
Thus, C language is called low-level language and more machine-friendly. If you think about common sense, if you are only good at optimizing it, it will be almost similar to what you've coded into the assembly. It's a good expression for creating programs like speed-critical operating systems.

Sunday 4 March 2018

Lab5 - Algorithm Selection

In this lab, we will look at how algorithms affect program execution and compile time. To understand this, I will use digital sound as an example. First, calculate the execution time using the Clock object in the sample code on aarchie server.

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"

// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
    return (int16_t) (volume_factor * (float) sample);
}

int main() {

    // Allocate memory for large in and out arrays
    int16_t*    in;
    int16_t*    out;
    in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
    out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));

    int        x;
    int        ttl;

    clock_t start, end;
    float ftime;

    // Seed the pseudo-random number generator
    srand(-1);

    // Fill the array with random data
    for (x = 0; x < SAMPLES; x++) {
        in[x] = (rand()%65536)-32768;
    }

    start = clock();

    // ######################################
    // This is the interesting part!
    // Scale the volume of all of the samples
    for (x = 0; x < SAMPLES; x++) {
        out[x] = scale_sample(in[x], 0.75);
    }
    // ######################################

    end = clock();

    // Sum up the data
    for (x = 0; x < SAMPLES; x++) {
        ttl = (ttl+out[x])%1000;
    }

    ftime = (float)(end - start) / CLOCKS_PER_SEC;

    // Print the sum
    printf("Result: %d\n", ttl);
    printf("Runing Time: %f sec.\n", ftime);

    return 0;

}

The result obtained by executing the above source is shown in the figure below.

Now let's expand the source and test it. The second test is as follows.
1. Pre-calculate a lookup table (array) of all possible sample values multiplied by the volume factor, and look up each sample to get the scaled values.

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"

// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
    return (int16_t) (volume_factor * (float) sample);
}

int main() {

    // Allocate memory for large in and out arrays
    int16_t*    in;
    int16_t*    out;
    in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
    out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));

    int16_t* lookUp = calloc(65536, sizeof(int16_t));

    int        x;
    int        ttl;

    clock_t start, end;
    float ftime;

    // Seed the pseudo-random number generator
    srand(-1);

    // Fill the array with random data
    for (x = 0; x < SAMPLES; x++) {
        in[x] = (rand()%65536)-32768;
    }

    for (x = 0; x < 65536; x++) {
        lookUp[x] = (x - 32768) * 0.75;
    }

    start = clock();

    // ######################################
    // This is the interesting part!
    // Scale the volume of all of the samples
    for (x = 0; x < SAMPLES; x++) {
        out[x] = (lookUp[in[x] + 32768]);
    }
    // ######################################

    // Sum up the data
    for (x = 0; x < SAMPLES; x++) {
        ttl = (ttl+out[x])%1000;
    }

    end = clock();

    ftime = (float)(end - start) / CLOCKS_PER_SEC;

    // Print the sum
    printf("Result: %d\n", ttl);
    printf("Runing Time: %f\n sec.", ftime);

    return 0;

}

Finally, the test contents are as follows.
2. Convert the volume factor 0.75 to a fix-point integer by multiplying by a binary number representing a fixed-point value "1". For example, you could use 0b100000000 (= 256 in decimal) to represent 1.00. Shift the result to the right the required number of bits after the multiplication (>>8 if you're using 256 as the multiplier).

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"

// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
    return (int16_t)(volume_factor * (float)sample);
}

int main() {

    // Allocate memory for large in and out arrays
    int16_t*    in;
    int16_t*    out;
    in = (int16_t*)calloc(SAMPLES, sizeof(int16_t));
    out = (int16_t*)calloc(SAMPLES, sizeof(int16_t));

    int16_t* lookUp = calloc(65536, sizeof(int16_t));

    int        x;
    int        ttl;

    clock_t start, end;
    float ftime;

    // Seed the pseudo-random number generator
    srand(-1);

    // Fill the array with random data
    for (x = 0; x < SAMPLES; x++) {
        in[x] = (rand() % 65536) - 32768;
    }

    start = clock();

    // ######################################
    // This is the interesting part!
    // Scale the volume of all of the samples
    for (x = 0; x < SAMPLES; x++) {
        out[x] = (int16_t)(in[x] * 0.75 * 256) >> 8;
    }
    // ######################################

    end = clock();

    // Sum up the data
    for (x = 0; x < SAMPLES; x++) {
        ttl = (ttl + out[x]) % 1000;
    }

    ftime = (float)(end - start) / CLOCKS_PER_SEC;

    // Print the sum
    printf("Result: %d\n", ttl);
    printf("Runing Time: %f\n sec.", ftime);

    return 0;

}

Let's compare the three tests at the same time.

Comparing the three tests, the second test was the slowest. The third test is not much different from the first. The third is the most optimized source.

Wednesday 28 February 2018

Lab4 - Vectorization Lab

In this lab, the performance of SIMD(Single Instruction Multiple Data) vectorization and auto-vectorization are examined by the GCC compiler. First of all, what is SIMD? It is a type of parallel processor that calculates multiple values simultaneously with a single instruction. It is often used in vector processors and is often used in multimedia applications such as video game consoles and graphics cards.

Below is the test source and compiles the source.
The compile options are: Below is the test source.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main() {
    int arrRand1[1000];
    int arrRand2[1000];
    int arrSum[1000];
    long int total;
    srand(time(NULL));

    for (int i = 0; i<1000; i++) {        
        arrRand1[i] = rand() % 2000 - 1000;
        arrRand2[i] = rand() % 2000 - 1000;

        arrSum[i] = arrRand1[i] + arrRand2[i];
        total += arrSum[i];
    }

    printf("%ld\n", total);

    return 0;
}

Disassemble the simple program above.

0000000000400560 <main>:
  400560:       d13f83ff        sub     sp, sp, #0xfe0
  400564:       d2800000        mov     x0, #0x0                        // #0
  400568:       a9007bfd        stp     x29, x30, [sp]
  40056c:       910003fd        mov     x29, sp
  400570:       a9025bf5        stp     x21, x22, [sp, #32]
  400574:       5289ba75        mov     w21, #0x4dd3                    // #19923
  400578:       a90153f3        stp     x19, x20, [sp, #16]
  40057c:       913f83b6        add     x22, x29, #0xfe0
  400580:       f9001bf7        str     x23, [sp, #48]
  400584:       72a20c55        movk    w21, #0x1062, lsl #16
  400588:       5280fa14        mov     w20, #0x7d0                     // #2000
  40058c:       910103b7        add     x23, x29, #0x40
  400590:       97ffffd8        bl      4004f0 <time@plt>
  400594:       97ffffeb        bl      400540 <srand@plt>
  400598:       97ffffde        bl      400510 <rand@plt>
  40059c:       97ffffdd        bl      400510 <rand@plt>
  4005a0:       9b357c01        smull   x1, w0, w21
  4005a4:       b84046e2        ldr     w2, [x23], #4
  4005a8:       9367fc21        asr     x1, x1, #39
  4005ac:       eb1602ff        cmp     x23, x22
  4005b0:       4b807c21        sub     w1, w1, w0, asr #31
  4005b4:       1b148020        msub    w0, w1, w20, w0
  4005b8:       510fa000        sub     w0, w0, #0x3e8
  4005bc:       0b020000        add     w0, w0, w2
  4005c0:       8b20c273        add     x19, x19, w0, sxtw
  4005c4:       54fffea1        b.ne    400598 <main+0x38>  // b.any
  4005c8:       aa1303e1        mov     x1, x19
  4005cc:       90000000        adrp    x0, 400000 <_init-0x4b8>
  4005d0:       911ee000        add     x0, x0, #0x7b8
  4005d4:       97ffffdf        bl      400550 <printf@plt>
  4005d8:       a9407bfd        ldp     x29, x30, [sp]
  4005dc:       52800000        mov     w0, #0x0                        // #0
  4005e0:       a94153f3        ldp     x19, x20, [sp, #16]
  4005e4:       a9425bf5        ldp     x21, x22, [sp, #32]
  4005e8:       f9401bf7        ldr     x23, [sp, #48]
  4005ec:       913f83ff        add     sp, sp, #0xfe0
  4005f0:       d65f03c0        ret
  4005f4:       00000000        .inst   0x00000000 ; undefined

Vectorization basically performs the same operation on successive data. Vectorization is a set of instructions that provides a SIMD (Single Instruction Multiple Data) architectures, in which the same operations are performed concurrently on successive data. Naturally, vectorization can result in higher performance than Single Instruction Single Data (SISD), which processes single data with a single existing instruction.

Tuesday 27 February 2018

Lab 3 - Loop

In this lab, I implement loops through assembly language based on x86 64 and Aarch64. The loop will expand in the "Hello world" program. "Hello world" code was written as below.

 .text  
 .globl     _start  
   
 _start:  
      movq     $len,%rdx        /* message length */  
      movq      $msg,%rsi       /* message location */  
      movq     $1,%rdi          /* file descriptor stdout */  
      movq     $1,%rax          /* syscall sys_write */  
      syscall  
   
      movq     $0,%rdi          /* exit status */  
      movq     $60,%rax         /* syscall sys_exit */  
      syscall  
   
 .section .rodata  
   
 msg:     .ascii   "Hello, world!\n"  
      len = . - msg

The first implementation will print a number from 0 to 9 on the screen. This is the code implemented on the x86 64 platform below.

 .text  
 .globl  _start  
   
 start = 0                  /* starting value for the loop */  
 max = 10                   /* ending value of loop */  
   
 _start:  
      mov      $start,%r15     /* loop index */  
   
 loop:  
      mov      %r15,%r14    /* copy loop index */  
      add      $48,%r14                 
      mov      %r14b,msg+6                 
   
      movq     $len,%rdx    /* message length */  
      movq     $msg,%rsi    /* message location */  
      movq     $1,%rdi      /* file descriptor stdout */  
      movq     $1,%rax      /* syscall sys_write */  
      syscall  
   
      inc      %r15         /* increment index */  
      cmp      $max,%r15    /* see if we're done */  
      jne      loop         /* loop if we're not */  
   
      movq     $0,%rdi      /* exit status */  
      movq     $60,%rax     /* syscall sys_exit */  
      syscall  
   
 .section .data  
   
 msg:     .ascii   "Loop: !\n"  
      len = . - msg

The second one will display numbers from 0 to 30 on the screen. 0 to 9 will be preceded by a zero. This is the code implemented on the x86 64 platform below.

 .text  
 .globl     _start  
   
 start = 0                  /* starting value for the loop */  
 max = 31                   /* end loop number */  
   
 _start:  
      mov   $start,%r15     /* loop index */  
      mov   $0x30, %r12                 
   
 loop:  
    mov  $'0',%r14   
    mov  $10,%r13   
    mov  $0,%rdx   
    mov  %r15,%rax   
    div  %r13   
    cmp  $0,%rax   
   
    mov  %rax,%r13       
    add  %r14,%r13   
    mov  %r13,msg+6   
   
    mov  %rdx,%r12   
    add  %r14,%r12   
    mov  %r12,msg+7   
   
    movq     $len,%rdx    /* message length */  
    movq     $msg,%rsi    /* message location */  
    movq     $1,%rdi      /* file descriptor stdout */  
    movq     $1,%rax      /* syscall sys_write */  
    syscall  
   
    inc      %r15         /* increment index */  
    cmp      $max,%r15    /* see if we're done */  
    jne      loop         /* loop if we're not */  
   
    movq     $0,%rdi      /* exit status */  
    movq     $60,%rax     /* syscall sys_exit */  
    syscall  
   
 .section .data  
   
 msg: .ascii "Loop:   \n"  
      len = . - msg

Finally, we implement the same loop code as above in the aarch64 platform.

 .text  
 .globl  _start  
 start = 0  
 max = 31  
 digit = 10  
   
 _start:  
     mov   x9, start  
     mov   x22, digit  
 loop:  
     mov   x0, 0  
     adr   x1, msg  
     mov   x2, len  
   
     mov   x8, 64  
     svc   0  
   
     mov   x10,10  
     adr   x23, msg  
     udiv  x20, x19, x22  
     msub  x21, x22, x20, x19  
     cmp   x9, 10  
   
     add   x20, x20, 0x30  
     strb  w20, [x1,6]  
   
     add   x21, x21, 0x30  
     strb  w21, [x1,7]  
   
     add   x19, x19, 1  
     cmp   x19, max  
     bne   loop  
     mov   x0, 0  
     mov   x8, 93  
     svc   0  
 .data  
     msg: .ascii "Loop: 0\n"  
     len = . - msg

The x86 64 and Aarch64 platforms should use different commands. We must specify the address value directly or indirectly with the command. Learning the assembler will give a good understanding of the computer system and structure, as well as a better understanding of memory.