Inside RubyVM
What is a Virtual Machine
Virtual Machine is a simulation of a computer system. When we talk about VMs in the context of languages, we are mostly talking about Process VMs. They are designed to enable us to execute a program in a platform independent environment.
One of the most popular Process VM is JVM. JVM allows developers using JDK to develop programs without worrying about the platform. A java program is first compiled into bytecode. Bytecode is nothing but special instructions which JVM interprets and executes. It also takes help of JIT to do faster execution but that’s a different topic.
What is RubyVM
Ruby VM also termed as YARV was introduced by Koichi Sasada[1] in 2004. It was developed as a Ruby C extension. It uses most of the existing ruby features such as GC, inline caching and Ruby Script parser. It was designed as an attempt to make ruby faster by using learnings for other languages like Java. Yet unlike JVM, no code is converted into machine code directly (JVM hotspots/JIT), instead, the Abstract Syntax Tree is compiled into YARV instructions, which can be interpreted faster. The Goal was to introduce new features such as JIT and AOT one step at a time.
And we already have JIT available in 2.6.0 preview. If you are interested in JIT, there is an article around it.[2]
Welcome YARV
Prior to Ruby 1.9, we used MRI also known Matz’s Ruby Interpreter. As the name explains, it was just a pure interpreter and no compilation for the ruby code was needed. This is slow most of the times. This comes down to the following stages.
Ruby Code → Tokenize → AST → Execute
This means when we write any ruby code, Ruby C code tokenizes our code, parses it into Abstract Syntax Tree and executed these nodes. So if we were to see the actual machine code that is being run by the CPU, It would convert back to Ruby C code which actually executes the AST and not our ruby code.
Now when YARV was introduced, one more step was added after parsing the tokens into AST. Instead of executing the nodes we convert them into YARV instructions.
So the execution stages look more like
Ruby Code → Tokenize → AST → YARV Inst → Execute
YARV is a stack-based virtual machine. It maintains a stack while executing the instructions. Along with that, it also maintains few registers and local tables. You can find the implementation details in Koichi’s RubyConf presentation[3].
RubyVM Modules
RubyVM module provides some access to Ruby internal. This module is for very limited purpose, such as debugging, prototyping, and research. Normal users must not use it.
> RubyVM.stat
=> {:global_method_state=>145, :global_constant_state=>1240, :class_serial=>8716}
Basically, it tells you the current status of certain things like the number of global methods, constants, and class serials. You can use it to find out what globals methods/constants are added to your application after say you add a gem to it.
> require('awesome_print')
=> true
> RubyVM.stat
=> {:global_method_state=>150, :global_constant_state=>1251, :class_serial=>9757}
Some of this information is also used for inline caching. Inline caching is the way ruby uses for faster method lookups.
Aaron Patterson wrote a nice post around this. [4]
Ruby provides RubyVM::AST to learn more about parsing.
> root = RubyVM::AST.parse("1 + 1")
=> #<RubyVM::AST::Node(NODE_SCOPE(0) 1:0, 1:5): >> root.methods - Object.methods
=> [:type, :first_column, :last_lineno, :last_column, :children, :first_lineno]> root.children
=> [nil, #<RubyVM::AST::Node(NODE_OPCALL(36) 1:0, 1:5): >]
We can use available methods to understand what line and column the node belong to and what type it has.
We can go on to check this tree but ruby also provides an easier way — dump parsetree
to print this tree.
>ruby --dump parsetree -e '1+1'
###########################################################
## Do NOT use this node dump for any purpose other than ##
## debug and research. Compatibility is not guaranteed. ##
############################################################ @ NODE_SCOPE (line: 1)
# +- nd_tbl: (empty)
# +- nd_args:
# | (null node)
# +- nd_body:
# @ NODE_PRELUDE (line: 1)
# +- nd_head:
# | (null node)
# +- nd_body:
# | @ NODE_CALL (line: 1)
# | +- nd_mid: :+
# | +- nd_recv:
# | | @ NODE_LIT (line: 1)
# | | +- nd_lit: 1
# | +- nd_args:
# | @ NODE_ARRAY (line: 1)
# | +- nd_alen: 1
# | +- nd_head:
# | | @ NODE_LIT (line: 1)
# | | +- nd_lit: 1
# | +- nd_next:
# | (null node)
# +- nd_compile_option: false
We can see AST Nodes and their children.
Now, what is the AST useful for? Most common application of an AST is the static analysis of code. For a simple example, if we wish to count the number of method calls that were made in our program. We can write a simple script for this.
code = <<CODE
def foo
puts "bar"
enddef bar
puts "foo"
end
CODEdef count_method_calls(node)
count = 0
return 0 if node.nil?
if node.children.count > 0
count += node.children.inject(0) do |total, n|
total + count_method_calls(n)
end
end
if node.type.to_s == "NODE_DEFN"
count += 1
end
return count
endroot = RubyVM::AST.parse(code)count = count_method_calls(root)
puts "Found #{count} method definitions"
And now we just call the script.
$> ruby count_definitions.rb
Found 2 method definitions
Next, Ruby Also provides RubyVM::InstructionSequence to us to understand what instructions our AST gets compiled to.
> inst = RubyVM::InstructionSequence.compile("2 + 2")
=> <RubyVM::InstructionSequence:<compiled>@<compiled>:1>> puts inst.disasm
== disasm: #<ISeq:<compiled>@<compiled>:1 (1,0)-(1,5)> (catch: FALSE)
0000 putobject 2 ( 1)[Li]
0002 putobject 2
0004 opt_plus <callinfo!mid:+, argc:1, ARGS_SIMPLE>, <callcache>
0007 leave
=> nil
compile
method returns the InstructionSequence object while disasm
method returns instruction sequence in human-readable string format.
Give it a try with more complex code examples and we can see how blocks are handled and we can even see the local tables which are maintained while these instructions are executed. We can also use this to look into more granularity like variable scoping. You can see instructions such assetlocal
in the case of local variables andsetglobal
for global variables.
I am interested in finding out what more can we find out by looking at instruction sequence.
Again as the doc mentions all of this is work in progress expect the things to break but eventually get better.
This is all I have for you.
I am still trying to learn more about this from a fantastic book by Pat Shaughnessy [5]
If you happen to have more understanding around this or if you find any incorrect information please feel free to correct me in the comments.
References:
[1] http://www.atdot.net/~ko1/
[2] https://www.johnhawthorn.com/2018/02/playing-with-ruby-jit-mjit/
[3] http://www.atdot.net/yarv/RubyConf2004_YARV_pub.pdf
[4] https://tenderlovemaking.com/2015/12/23/inline-caching-in-mri.html