llvm IR学习

IR结构

Module：每个module包含全局变量列表，函数列表，该模块所依赖的库（或其他module）列表，符号表以及有关目标特性的各种数据。
Function：编程语言中的函数，包括函数签名和若干个基本块，函数内的第一个基本块叫做入口基本块。
BasicBlock：是一组顺序执行的指令集合，只有一个入口和一个出口，控制流只能从第一个指令进入该块。非头尾指令执行时不会违背顺序跳转到其他指令上去。每个基本块最后一条指令一般是跳转指令（跳转到其它基本块上去），函数内最后一个基本块的最后条指令是函数返回指令。
Instruction：指令是LLVM IR中的最小可执行单位，每一条指令都单占一行

IR语法

获取IR

1 2	$ clang -emit-llvm -c hello.c -o hello.bc #二进制码 $ clang -emit-llvm –S -c hello.c -o hello.ll #获取IR

基本语法

LLVM 汇编语言中的注解以分号 ;开始，并持续到行末
全局标识符要以 @ 字符开始。所有的函数名和全局变量都必须以 @ 开始。
LLVM 中的局部标识符以百分号 (%) 开始。标识符典型的正则表达式是 [%@][a-zA-Z$._][a-zA-Z$._0-9]*。
LLVM 拥有一个强大的类型系统，LLVM 将整数类型定义为 i*N*，其中 N 是整数占用的字节数。如：i32，i64
矢量或阵列类型声明为 [no. of elements X size of each element]。对于字符串 “Hello World!”，可以使用类型 [13 x i8]，假设每个字符占用 1 个字节，再加上为 NULL 字符提供的 1 个额外字节。
hello-world 字符串的全局字符串常量进行如下声明：@hello = constant [13 x i8] c"Hello World!\00"。使用关键字 constant 来声明后面紧跟类型和值的常量。
LLVM 允许声明和定义函数。以关键字 define 开始，后面紧跟返回类型，然后是函数名。返回 32 字节整数的 main 的简单定义类似于：define i32 @main() { ; some LLVM assembly code that returns i32 }。
函数声明：以 puts 函数为例，declare i32 puts(i8*)。该声明以关键字 declare 开始，后面紧跟着返回类型、函数名，以及该函数的可选参数列表。该声明必须是全局范围的。
每个函数均以返回语句结尾。有两种形式的返回语句：ret <type> <value> 或 ret void。对于简单的主例程，使用 ret i32 0 就足够了。
使用 call <function return type> <function name> <optional function arguments> 来调用函数。注意，每个函数参数都必须放在其类型的前面。返回一个 6 位的整数并接受一个 36 位的整数的函数测试的语法如下：call i6 @test( i36 %arg1 )。
如果函数入口块没有明确的标签，则会分配标签％0，那么该块中的第一个未命名的临时块将为％1，以此类推。

示例分析

都写在注释里了。

; ModuleID = 'hello.c'   
source_filename = "hello.c"  ;指明源文件名
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"  
;指明机器架构和数据布局
target triple = "x86_64-pc-linux-gnu"
;定义字符串 hello world，unnamed_addr表明该地址不重要，并且可以合并两个相同的函数。
@str = private unnamed_addr constant [14 x i8] c"Hello worl1d.\00"

; Function Attrs: nounwind uwtable
;定义函数，local_unnamed_addr表明该地址在模块内是不明显的
define i32 @main() local_unnamed_addr #0 {
;调用puts函数
  %1 = tail call i32 @puts(i8* getelementptr inbounds ([14 x i8], [14 x i8]* @str, i64 0, i64 0))
  ret i32 0
}

; Function Attrs: nounwind
;函数声明
declare i32 @puts(i8* nocapture readonly) local_unnamed_addr #1

attributes #0 = { nounwind uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { nounwind }

!llvm.module.flags = !{!0}
!llvm.ident = !{!1}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{!"clang version 6.0.0-1ubuntu2 (tags/RELEASE_600/final)"}

getelementptr：

getelementptr 的第一个参数是全局字符串变量的指针。要单步执行全局变量的指针，则需要使用第一个索引，即 i64 0。因为 getelementptr 指令的第一个参数必须始终是 pointer 类型的值，所以第一个索引会单步调试该指针。0 值表示从该指针起偏移 0 元素偏移量。我的开发计算机运行的是 64 位 Linux，所以该指针是 8 字节。第二个索引 (i64 0) 用于选择字符串的第 0 个元素，该元素是作为 puts 的参数来提供的。

参考文章：

https://llvm.zcopy.site/docs/langref/

http://www.nagain.com/activity/article/7/

https://www.ibm.com/developerworks/cn/opensource/os-createcompilerllvm1/index.html