CS350 Intro Computer Systems

CS350 Intro Computer Systems Homework

Homework 7 on Structures, Stack Overflow (Buffer Overflow), Floating Points Arithmetic

For each of the following structure declarations, determine the offset of each field, the total size of the structure, and its alignment requirement for x86-64. See Practice problem 3.44 of page 275 for reference. It's best if you write C code to answer the question.
```
struct P1 { char a; int b; char c; int d; };
struct P2 { long a; char b; int c; char d; };
struct P3 { char a[2]; short b[5]; };
struct P4 { char *a[4]; short b[3]; };
struct P5 { struct P2 a; struct P3 b[3]; };
  
```
offset total size alignment requirement

a b c d

P1

P2

P3

P4

P5

The following code is similar to the code shown in Practice problem 3.48 of page 288.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>

int len(char *s) {
  return strlen(s);
}

void lptoa(char *s, long *p) {
  long val = *p;
  sprintf(s,"%ld",val);
}

long longlen(long x) {
  long v;
  char buf[8];
  v = x;
  lptoa(buf,&v);
  return len(buf);
}

int main() {
  longlen(INT_MAX-1);
}

Compile the code with and without stack protector.

Fill the table with appropriate values. Leave the entries empty if not applicable.

no stack protector stack protector

gcc flag

len assembly for allocating stack

stack size in decimal

assembly for freeing stack

lptoa assembly for allocating stack

stack size in decimal

assembly for freeing stack

"char *s" address relative to rsp after entering lptoa

"long *p" address relative to rsp after entering lptoa

"val" address relative to rsp after entering lptoa

longlen assembly for allocating stack

stack size in decimal

assembly for freeing stack

"x" address relative to rsp after entering longlen

"v" address relative to rsp after entering longlen

"buf" address relative to rsp after entering longlen

canary register name

canary address relative to rsp

canary value

assembly for erasing canary value

assembly for canary cross check

Do problem 3.72 of page 323.

Pages 36-37 of lecture notes "08-machine-data.pdf" shows how xmm registers are used for simple floating point arithmetic such as fadd, dadd, an dincr which increments *p, a pointer to double variable by value v.

For Wed-section you have to follow the aforementioned pages.

For both sections, read 3.11 Floating-Point code.

Write a C function for the following assembly. cvtsd2ss converts scalar double precision number (a double number) to scalar single precision (a float number) while cvtss2sd converts scalar single precision number (a float number) to scalar double precision (a double number).

      unknown1:
	subss	%xmm1, %xmm0
	ret

      unknown2:
	subsd	%xmm1, %xmm0
	ret

      unknown3:
	movaps	%xmm0, %xmm1
	cvtsd2ss	(%rdi), %xmm0
	addss	%xmm0, %xmm1
	cvtss2sd	%xmm1, %xmm2
	movsd	%xmm2, (%rdi)
	ret

      unknown4:
	movsd	(%rdi), %xmm1
	movapd	%xmm1, %xmm2
	subsd	%xmm0, %xmm2
	movsd	%xmm2, (%rdi)
	movapd	%xmm1, %xmm0
	ret

      unknown5:
	pxor	%xmm0, %xmm0
	movl	$0, %eax
	jmp	.L13
.L14:
	movss	(%rdx,%rax,4), %xmm1
	mulss	(%rsi,%rax,4), %xmm1
	addss	%xmm1, %xmm0
	addq	$1, %rax
.L13:
	cmpq	%rdi, %rax
	jb	.L14
	rep ret

      unknown6:
	pxor	%xmm0, %xmm0
	movl	$0, %eax
	jmp	.L10
.L11:
	movsd	(%rdx,%rax,8), %xmm1
	mulsd	(%rsi,%rax,8), %xmm1
	addsd	%xmm1, %xmm0
	addq	$1, %rax
.L10:
	cmpq	%rdi, %rax
	jb	.L11
	rep ret

SSE/AVX/AVX512 programming: write assembly statements for overlapping two images of equal size to produce a new image.
See the figure attached which shows four steps that overlap two images (B and E) to result in G. Given two images, B and E, you are asked to produce image G with no blue background. Assuming the images each are of size 1K x 1K pixels or 1 MB, where each pixel is a byte representing 256 colors. For 1024 * 1024 (1 MB) images, a simple minded loop would require 1 million iterations (1K * 1K), where each iteration works on a pixel. For large-sized images, this simple minded loop is prohibitively expensive. A solution to this expensive computation beside CUDA and OpenCL for GPGPU programming is SSE/AVX/AVX512 instructions which allow for one instruction to operate on 16, 32, or 64 bytes at the same time. With SSE/AVX/AVX512, a row for 1024 * 1024 (1 MB) image can now be processed in 64, 32, or 16 iterations. The entire image can therefore require only 64K, 32K, or 16K iterations instead of 1 million.
The assembly instructions required for this problem are variation of PCMPEQ, ANDP, ANDNP, and POR which are documented in the Intel manual (https://software.intel.com/en-us/articles/intel-sdm#combined). For example, the instructions for AVX512 to overlap the two images are variations of the following AVX512 instructions:
```
	VPCMPEQD zmm1, zmm2, zmm3 /m256 -> cmp
	VANDPS zmm1, zmm2, zmm3/m256 -> and
	VANDNPS zmm1, zmm2, zmm3/m256 -> !and
	VPORQ zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst -> or
      
```
Assuming
- image A is loaded in xmm/ymm/zmm1
- image B is loaded in xmm/ymm/zmm3
- image E is loaded in xmm/ymm/zmm4
and 8-bit 256-color map below:
- red is 0xE0,
- greeen is 0x1C,
- blue is 0x03,
- yellow is 0xFC
- white is 0xFF
- black is 0x00
write four assembly instructions for
- 128-bit SSE using xmm registers
- 256-bit AVX using ymm registers
- 512-bit AVX512 using zmm registers
The four assembly instructions are called loop body inside a loop that iterates 64K, 32K, or 16K depending on SSE, AVX, or AVX512. You may use register xmm/ymm/zmm2 to hold a temporary value.

		no stack protector	stack protector
gcc flag
len	assembly for allocating stack
	stack size in decimal
	assembly for freeing stack

lptoa	assembly for allocating stack
	stack size in decimal
	assembly for freeing stack
	"char *s" address relative to rsp after entering lptoa
	"long *p" address relative to rsp after entering lptoa
	"val" address relative to rsp after entering lptoa

longlen	assembly for allocating stack
	stack size in decimal
	assembly for freeing stack
	"x" address relative to rsp after entering longlen
	"v" address relative to rsp after entering longlen
	"buf" address relative to rsp after entering longlen
	canary register name
	canary address relative to rsp
	canary value
	assembly for erasing canary value
	assembly for canary cross check

	offset
	a	b	c	d
P1
P2
P3
P4
P5