struct P1 { char a; int b; char c; int d; }; struct P2 { long a; char b; int c; char d; }; struct P3 { char a[2]; short b[5]; }; struct P4 { char *a[4]; short b[3]; }; struct P5 { struct P2 a; struct P3 b[3]; };
offset | total size | alignment requirement | ||||
---|---|---|---|---|---|---|
a | b | c | d | |||
P1 | ||||||
P2 | ||||||
P3 | ||||||
P4 | ||||||
P5 |
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <limits.h> int len(char *s) { return strlen(s); } void lptoa(char *s, long *p) { long val = *p; sprintf(s,"%ld",val); } long longlen(long x) { long v; char buf[8]; v = x; lptoa(buf,&v); return len(buf); } int main() { longlen(INT_MAX-1); }Compile the code with and without stack protector.
Fill the table with appropriate values. Leave the entries empty if not applicable.
no stack protector | stack protector | |||
---|---|---|---|---|
gcc flag | ||||
len | assembly for allocating stack | |||
stack size in decimal | ||||
assembly for freeing stack | ||||
lptoa | assembly for allocating stack | |||
stack size in decimal | ||||
assembly for freeing stack | ||||
"char *s" address relative to rsp after entering lptoa | ||||
"long *p" address relative to rsp after entering lptoa | ||||
"val" address relative to rsp after entering lptoa | ||||
longlen | assembly for allocating stack | |||
stack size in decimal | ||||
assembly for freeing stack | ||||
"x" address relative to rsp after entering longlen | ||||
"v" address relative to rsp after entering longlen | ||||
"buf" address relative to rsp after entering longlen | ||||
canary register name | ||||
canary address relative to rsp | ||||
canary value | ||||
assembly for erasing canary value | ||||
assembly for canary cross check |
For Wed-section you have to follow the aforementioned pages.
For both sections, read 3.11 Floating-Point code.
Write a C function for the following assembly. cvtsd2ss converts scalar double precision number (a double number) to scalar single precision (a float number) while cvtss2sd converts scalar single precision number (a float number) to scalar double precision (a double number).
unknown1: subss %xmm1, %xmm0 ret
unknown2: subsd %xmm1, %xmm0 ret
unknown3: movaps %xmm0, %xmm1 cvtsd2ss (%rdi), %xmm0 addss %xmm0, %xmm1 cvtss2sd %xmm1, %xmm2 movsd %xmm2, (%rdi) ret
unknown4: movsd (%rdi), %xmm1 movapd %xmm1, %xmm2 subsd %xmm0, %xmm2 movsd %xmm2, (%rdi) movapd %xmm1, %xmm0 ret
unknown5: pxor %xmm0, %xmm0 movl $0, %eax jmp .L13 .L14: movss (%rdx,%rax,4), %xmm1 mulss (%rsi,%rax,4), %xmm1 addss %xmm1, %xmm0 addq $1, %rax .L13: cmpq %rdi, %rax jb .L14 rep ret
unknown6: pxor %xmm0, %xmm0 movl $0, %eax jmp .L10 .L11: movsd (%rdx,%rax,8), %xmm1 mulsd (%rsi,%rax,8), %xmm1 addsd %xmm1, %xmm0 addq $1, %rax .L10: cmpq %rdi, %rax jb .L11 rep ret
See the figure attached which shows four steps that overlap two images (B and E) to result in G. Given two images, B and E, you are asked to produce image G with no blue background. Assuming the images each are of size 1K x 1K pixels or 1 MB, where each pixel is a byte representing 256 colors. For 1024 * 1024 (1 MB) images, a simple minded loop would require 1 million iterations (1K * 1K), where each iteration works on a pixel. For large-sized images, this simple minded loop is prohibitively expensive. A solution to this expensive computation beside CUDA and OpenCL for GPGPU programming is SSE/AVX/AVX512 instructions which allow for one instruction to operate on 16, 32, or 64 bytes at the same time. With SSE/AVX/AVX512, a row for 1024 * 1024 (1 MB) image can now be processed in 64, 32, or 16 iterations. The entire image can therefore require only 64K, 32K, or 16K iterations instead of 1 million.
The assembly instructions required for this problem are variation of PCMPEQ, ANDP, ANDNP, and POR which are documented in the Intel manual (https://software.intel.com/en-us/articles/intel-sdm#combined). For example, the instructions for AVX512 to overlap the two images are variations of the following AVX512 instructions:
VPCMPEQD zmm1, zmm2, zmm3 /m256 -> cmp VANDPS zmm1, zmm2, zmm3/m256 -> and VANDNPS zmm1, zmm2, zmm3/m256 -> !and VPORQ zmm1 {k1}{z}, zmm2, zmm3/m512/m64bcst -> orAssuming