Blog Closed

This blog has moved to Github. This page will not be updated and is not open for comments. Please go to the new site for updated content.

Wednesday, September 9, 2009

Refactoring Parrot Startup

jrtayloriv earned his commit-bit this week, no doubt on the back of his wonderful work with the patch that lead to the kill_parrot_cont branch (which merged a few days ago). First thing that he did was dive into the new gc-refactor branch which is doing something that I've wanted for a while: Making GC pluggable. He's doing a lot of great work already, and I look forward to following along as he makes more progress. I'll talk more about this branch and some other GC ideas that have been floating around in another post.

As part of his work, jrtayloriv wants to add the ability to specify the GC core on the commandline, in the same way that you can currently specify the runcore. However, while looking into this, he noticed a problem: All command line argument processing happens inside IMCC, which is called after the interpreter is initialized. Since the GC is initialized when the interpreter is, we've essentially created the GC core before we've read the command-line options to set it. This is bad.

I've mentioned before that the current startup situation is really a mess. We create an interpreter, call into IMCC, and all program loading and execution happens from there. This is backwards and lousy. We shouldn't be delegating all operation to IMCC, especially when it's on the chopping block in the long term. At the very least we are going to eventually replace IMCC with PIRC. A far better solution would involve a pluggable parser situation, but that's far beyond the scope of this blog post. Let's instead talk about things that need to change in the short term.

cotto posted a patch a few days ago to add a function to the embedding interface that would parse through the program's argv array and convert it into a ResizableStringArray. I can't find the link right now, but essentially he iterated over the argv array and pushed all the strings onto a ResizableStringArray. This would be ideal, because Parrot can work very easily with these PMC objects. It would also open up a lot of cool opportunities to access these options from PIR. However, this raises the immediate problem that we can't allocate a PMC from the GC that contains options for the GC, until the GC is initialized. We could maybe create the PMC on the stack, but then we would need to create all the STRING objects that go into that array there too, and that's a huge problem. We don't want to be allocating all these things using malloc, for certain.

Let's ignore this particular chicken-and-egg problem for a little while. Instead let's take a look at what Parrot should be doing. On program startup we parse through the incoming arguments, separating out the args that need to go to IMCC, the args that go to the interpreter initialization, and the args that need to go to the executing PIR/PBC program. This needs to happen first, before we create anything. First thing we create after the args are sorted out is the interpreter.

The next steps are more open to interpretation. What I would like, ideally, would be to not call IMCC immediately, but instead enter into the runloop immediately and call the PIR/PASM compreg from there. Notice from some of my descriptions below that we will want to expand the capabilities of the PIR/PASM compreg to handle what we want to add. IMCC, instead of executing the subs it compiles directly, returns a reference to them back to the runloop which invokes that and continues execution from there. No recursion, no nonsense. We do run into a problem when dealing with special subs, like :postcomp, :immediate, :load and other types; but these aren't insurmountable. What we can do to get past this is to integrate the scheduler to the process. The IMCC compreg compiles down all it's subroutines, adding any :immediate, :load or :init subs, depending on the command-line options, to the scheduler. We're obviously going to run into problems trying to do this with :immediate subs, but I don't think those problems are insurmountable either.

Since we enter into the runloop immediately, we need a stub program to run that will get the ball rolling. Here is what I think it could be:

.sub _internal_main
.param string type
.param string filename
.param pmc args :slurpy
$P0 = compreg type
$P1 = $P0.'compile_main'(filename)
execute_immediate_subs $P0
$P1()
.end

This little bootstrapped entry point function does three things: it finds the appropriate compreg to use to execute the program, it compiles down the file which adds some subs to the scheduler and returns :main, it executes all the :immediate, :init, and/or :load subs (using a hypothetical new opcode), and then it invokes the :main function to get the program started. Initially we would be able to support PASM, PIR, and PBC compregs (the last of which would need to be written for this purpose), but eventually we could be including other languages as well and calling them directly. It's worthwhile at this juncture, although not something I want to talk about in depth yet, to consider that Parrot may be treating a higher-level language then PIR or PASM as it's default native language. Another post for another time.

A second option, which I like slightly less but is probably easier to do, is to call into IMCC before entering the runloop and having IMCC return the starting sub to invoke which is then passed to the runloop. I don't have a strong aversion to this idea, but there are a few reasons why I would prefer to get away from it (and I'll be happy to share those with anybody who is interested).

So these are some of my long-term ideas about how Parrot startup should happen. There are obviously some issues to work out here, and I would love to hear some feedback about these things. I don't know when I would have time to work on this, but I would love to try and squeeze it in before or shortly after 2.0.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.