Blog Closed

This blog has moved to Github. This page will not be updated and is not open for comments. Please go to the new site for updated content.

Sunday, February 28, 2010

Proposal to Change find_method

Austin Hastings, as part of his Kakapo project (which I now have a commit bit to!) has started creating a mock object framework. We were talking about how to implement expected method calls, so I took a look at the find_method VTABLE of the Object PMC for some inspiration. What I saw was absolutely horrible, so I promptly created a branch to fix it. However, the more I looked and edited, the bigger I found the problems to be. I'll talk more about Kakapo in another post.

When I do code like this:

$P0 = new ['Foo']
$P0.'Bar'()

What is really happening is something similar to this:

$P0 = new ['Foo']
$P1 = find_method $P0, 'Bar'
callmethodcc $P0, $P1

Internally, the find_method opcode calls the VTABLE_find_method function on the given object. The object itself is expected then to walk the method resolution order (MRO) of it's inheritance hierachy to find a suitable method and return it. Along the way, the Object PMC needs to completely violate the encapsulation of the Class PMC to gather information about the MRO and then to search the list of methods in the Class for an entry with the given name. In short version, the C code from Object.find_method looks like this:

int num_classes = VTABLE_elements(interp, class->all_parents);
int i;
for (i = 0; i < num_classes; i++) {
cur_class = VTABLE_get_pmc_keyed_int(interp,class->all_parents, i);
if (VTABLE_exists_keyed_str(interp, class->methods, name))
return VTABLE_get_pmc_keyed_str(interp, class->methods, name);
}

So Object reads the attributes of it's Class PMC directly, and manually traverses the MRO looking for the proper method. This causes a few problems. First, as a mostly stylistic point, this completely breaks encapsulation. We can't make a change to the MRO or the method storage and lookup mechanism in Class without likewise changing the behavior in Object.

Second point, since Object needs to know how to traverse the MRO and lookup methods, and requires intimate internal knowledge of the classes in the MRO, we are extremely limited in the types of objects that can be in the inheritance hierarchy. That is, we can't define our own metaobject types, we must use Class or PMCProxy, or a subclass thereof (and a careful reading of the code suggests that even subclasses will not work). This seems to be a remarkable limitation when you consider some of the diverse high-level languages that Parrot aims to support.

One thing I tried to do was create a find_method VTABLE in the Class PMC, and then delegate traversal of the MRO to Class instead of Object. This helped improve encapsulation greatly, but created another problem: Now I couldn't call methods on Class itself. Here's example code that broke:

$P0 = getclass 'Foo'
$P0.'add_vtable_override'("bar")

What we want to do is call a method on the class object itself, but what we end up doing is finding a method on objects of that type, and then trying to call that method on the class object. Problems.

Let's recap some issues:
  1. Find_method searches for a method to use on a given invocant
  2. The Class type has methods that need to be accessible through find_method
  3. Object has to break encapsulation and monkey around in Class's internals, which means we can only use Class objects, and objects strictly isomorphic to Class (like PMCProxy) in an MRO
  4. We cannot delegate the method lookup operation to the Class object, where it arguably belongs.
With these things in mind, I had an idea that I sent to the list which aims to fix all this: Create a new VTABLE function that searches for a method in a metaobject, instead of searching for the method on the invocant (like find_method does now). In terms of PIR, I'm thinking of enabling this kind of sequence:

$P0 = new ['Foo']
$P1 = getclass 'Foo'
$P2 = find_class_method 'Bar'
callmethodcc $P0, $P2

I don't want to remove find_method or change it in any way. But what I want to have is a way to delegate method lookup to the Class object as well. I think we will find that when we have a way to delegate lookup to the Class object that we will use it much more frequently and to greater effect than we use find_method now. I also think we will find that find_method can eventually be deprecated entirely, but that's another issue for another time.

One other problem that I failed to mention above is that every class has it's own completely linearized resolution order. So if Foo is a Bar, and Bar is a Baz, the Foo class has the MRO ("Foo", "Bar", "Baz"), Bar would have the MRO ("Bar", "Baz"), and Baz would have the MRO ("Baz"). Asking the Foo Class object for a method "Frobulate" would look in Bar, which would ask Baz. Then, Foo would move to the next item in it's MRO, Baz, and ask it. The net result is that Baz would be queried twice, since the Foo Class item doesn't know necessarily that Baz is in Bar's MRO, and Bar doesn't know that it is being queried from Foo (maybe Bar was being queried directly). So what we need is some kind of way to keep track of the MRO up front, and avoid re-defining the search MRO for each new delegation.

I think we could solve this issue if we defined a new VTABLE like this:

VTABLE PMC * find_class_method(STRING *name, PMC *mro_iterator)

In this conception, SELF would be the metaobject currently being searched, name would be the string name of the method to find, and mro_iterator would be an iterator object for the MRO list. When we do the PIR code:

$P0 = getclass "Foo"
$P1 = find_class_method $P0, "Frobulate"

The first call to the Foo class object would be VTABLE_find_class_method("Frobulate", NULL). Foo would then create an iterator over it's MRO (removing itself from the front of the list to avoid direct recursion) and passing that MRO iterator to Bar, which then calls the next item on the list (Baz). This has a few major advantages which are not necessarily obvious up front: Any object that defines find_class_method can be inserted into the MRO. This includes things that aren't really classes like Roles, Mixins, extension methods, and even autoloaders. Second, we gain more flexibility to modify the MRO of a class, because that class (and it's super-classes) can add additional search parents to the iterator as needed. We would also gain the ability to have more manual control over the MRO, because we could add a find_class_method_p_p_s_p op variant that also takes an existing MRO iterator. This would enable us to better implement something like a super() call, where we take the MRO iterator, manually pop the top item off it, and then call find_class_method with it. I've got several bonus points available to whoever can explain how to call a method in a super class when it's overridden in the subclass, without having to hard-code in the name of the parent class. With the new VTABLE and a new op, this becomes trivial.

So that's my idea for method lookups. I've sent a mail to the list with the idea, and I'm going to raise the idea at #ps if I can make it to the meeting. I think it has a lot of merit, enables a few cool new abilities and doesn't take away any existing functionality. I would like to hear any other ideas, but I'm becoming convinced that this one is a winner.

Wednesday, February 24, 2010

PDD23 Exceptions Critique

Following my post a few days ago, I would like to take a more in-depth look at PDD23, which lays the specification for the exceptions subsystem. I hadn't intended to go through line-by-line, but in a lot of places I have to.

[Update: I wrote this post at the same time as I wrote the last one on the topic, but I delayed in posting this one until now. In the interim time, Austin created a page on the wiki to plan out a major refactor of the system and Tene started a branch to do some work. I'll post updates on both those things as they happen.]

exceptions are indications by running code that something unusual -- an "exception" to the normal processing -- has occurred. When code detects an exceptional condition, it throws an exception object. Before this occurs, code can register exception handlers, which are functions (or closures) which may (but are not obligated to) handle the exception. Some exceptions permit continued execution immediately after the throw; some don't.

Exceptions transfer control to a piece of code outside the normal flow of control. They are mainly used for error reporting or cleanup tasks.

This is, essentially, the preamble to the rest of the document and already shows some disconnect with reality. High level languages are already using exceptions to handle normal control flow in some cases. In this case they are less "exception" and more "expection". I could go on and talk about how bad an idea it is to use exceptions for normal control flow for a variety of reasons, but I won't. I know that Parrot's control flow model still isn't mature enough to tackle all the cases that HLLs have been digging up, so exceptions are the only available mechanism to implement some structures. Also, that I am aware of, no exception prevents resuming after the point of the throw. I believe that determination is left up to the handler.

When an exception is thrown, Parrot walks up the stack of active exception handlers, invoking each one in turn, but still in the dynamic context of the exception (i.e. the call stack is not unwound first).

I need to carefully read through some of the code again, but I'm pretty certain that this is patently false. ExceptionHandlers are implemented as Continuations which do rewind the call stack and are executed in the dynamic context of the function that contains the handler. Again, I need to look at all the code and the semantics in greater detail, but at the very least this is highly suspect.

Exception handlers can resume execution after handling the exception by
invoking the continuation stored in the 'resume' slot of the exception object. That continuation must be invoked with no parameters; in other words, throw never returns a value.

Not a problem here so much as a little nit. Why can't exception resumes return values? If you think about common exception uses in some popular programming languages this is never used. But when you consider that exceptions in Parrot are currently used, as I mention above, to implement complex control flow, you start to see that there is maybe some utility to it. Slightly more to the point, what if the resume object wasn't just a continuation pointing to the opcode after the throw instruction, but was instead a Sub object representing a lexically-scoped finally{} block that needed to be invoked? I can come up with a few ideas of places where the functionality to pass parameters to the resume continuation might be nice to have. It's interesting to consider that maybe we resume to a multi-sub, which dispatches to a post-handler routine based on signature? I have several ideas like this, and while they are all a little bit off the radar of current programming languages they are by no means unthinkable or undesirable in the long run. If it's possible to provide this, and Parrot's internal mechanics should certainly make it so, I don't see why we would artificially limit it.

The die opcode throws an exception of type exception;death and severity
except_error with a payload of message. The exception payload is a string PMC containing message.

I have been accused of being anti-Perl, and I maintain that I am not. Maybe I'll devote another blog post to the topic later. But I don't think Parrot needs a "die" op that does what it does here. I can understand and appreciate that Perl is very motivated by linguistic factors, and that Parrot has been traditionally very influenced by Perl. But Parrot's opcodes represent an assembly language, and using these kinds of linguistic features seems a little bit out of place. Why have "die", when we have "exit"?

The routines to search the op library are not linear. I think it uses a skip list, but I haven't studied the implementation enough to be able to say so definitively. What I do know is that the time it takes to search the oplist for a valid op name is proportional in some measure to the number of op "short" names. I think it's O(log n). As an example, die_s, die_p, and die_i_i all have the short name "die". IMCC, during lexical analysis, looks to determine whether an opcode exists in the library using it's short name. Later in the process, IMCC hunts down the exact long name of the op, which again uses the same algorithm (skip list?) but looks at long names instead of short names. I'll spare more details on this point, but the lesson is clear: Having fewer ops is better for IMCC's code generation performance. Having fewer short names (even if the number of ops remains the same) improves parsing performance in IMCC. For a PIR-based benchmark, we would see some improvement (though admittedly it would be very small) if we did nothing besides rename all "die" opcodes to "exit" instead.

When I see the word "die", It seems to me like it should do what it says: Kill the program. Do not pass go. Do not collect 200 dollars. Die. I can't imagine having any other preconception about it. Why would the "die" opcode not make the program...die? At least, why not without an explicitly-defined mechanism to prevent it, such as how Perl5 uses eval()? So you can imagine my surprise that die seems to throw just another exception that can be caught. You can imagine how perplexed I tried calling

die 'Program is closing'

or

exit 0

didn't exit my program! Instead, I had to use

die 5, 0

to tell the system that yes, I actually wanted the program to shut down. Of course, now I can't supply a helpful message about why we need to die. It's also surprising to me that, for some reason, the exit opcode seems to have the same general behavior. It doesn't actually exit if you have a handler active, and doesn't have an overload that let's you manually specify a severity that forces an exit. So that seems pointless to me. Again, what else could the word "exit" mean besides "get out of my damn program"?

All exceptions will have at least message, severity, resume, and payload attributes.

There are three forms of the die opcode: die_s, die_p, and die_i_i. The first two basically throw a normal, catchable exception with the given argument treated as the string message to display to the user. The third form throws a normal, catchable exception with a user-definable severity and error code. The exit opcode has form exit_i, which throws a normal, catchable exception with only the given error code. The throw opcode has flavors throw_p, and throw_p_p, which let you throw a given pre-constructed exception, optionally with a given resume continuation. This all seems like a hugely redundant waste of opcodes which all essentially do the same thing but each of which only lets you specify a subset of the parameters that every exception object is supposed to provide. None of the opcodes allow you to specify a payload, even though the spec suggests (as I will discuss below) the payload should be used for type filtering by HLLs, and the current implementation prevents proper type subclassing!

"die" lets you specify a message or a severity and error code, but doesn't actually make the program die. "exit" lets you specify an error code only, and doesn't necessarily make the program exit. "throw" lets you specify a pre-built exception and optionally a custom resume continuation only. Considering that every exception must have a message, severity, resume, and payload, this assortment of opcodes really doesn't make any sense at all.

I won't harp on opcodes any further in this post, but I think I've made my point: The ops we do have are a stupid mish-mash of the kinds of ops we need to work with exceptions. If every Exception must have resume, severity, type, and payload, why do our ops not support that? Why do we have die, when we have throw, rethrow, and exit? I highly suggest we slim down these opcodes. I think an exit_i opcode is fine, if it forces an exit in lieu of a specifically-defined exit-handler. That is, most handlers would not handle exit events by default, allowing the exit op to do what we expect. To catch and handle these types, which would be necessary in some places involving embedding or nesting, we could specifically define an exit-handler type that is capable of catching them.

I think a throw_p opcode is all we really need to throw other types of exceptions. Maybe, if we were worried about writing out all the PIR for constructing elaborate Exception objects, we could have a throw_p_s_i_i, which would set all four required attributes at once, and throw it.

Anyway, that's enough on this particular subtopic. But, in tangent, I would like to suggest again that we try to find a good way to specify aggregate literals in PIR code. In this way we could specify exception constants (or proto-exception initializer objects) to reduce the runtime cost of constructing exceptions where things like the severity, type, and message are the same. The ability to specify ExceptionHandler constants in the code likewise would create a huge performance savings, especially when you consider that in a normally-operating program more ExceptionHandlers are created and registered than Exceptions.

count_eh Return the quantity of currently active exception handlers.

I'm not certain that we need an opcode for this, especially since I think it's used pretty infrequently. A method call on the current context object could provide the same info. A series of methods would allow fine-grained manipulation of the handler stack, which would be even better.

If no handler is found, and the exception is non-fatal (such as a warning), and there is a continuation in the exception record (because the throwing opcode was throw), invoke the continuation (resume execution). Whether to resume or die when an exception isn't handled is determined by the severity of the exception.


I'm not sure if the implementation follows the letter of the spec in regards to the "exception record". As far as I am aware, an unhandled exception doesn't automatically cause the program to resume normal control flow no matter what type it is. I need to check on this, but I have never witnessed this behavior. If it does exist, I apologize for not knowing about it, of course.

typedef enum {
EXCEPT_normal = 0,
EXCEPT_warning = 1,
EXCEPT_error = 2,
EXCEPT_severe = 3,
EXCEPT_fatal = 4,
EXCEPT_doomed = 5,
EXCEPT_exit = 6
} exception_severity;

As Austin mentioned, there are way too many of these. Also, as I've found out experimentally, only EXCEPT_doomed actually causes Parrot to exit despite other severities having harmful-sounding names like "fatal", and "exit". In my mind we need only four severities, at most: Trivial, Normal, Fatal and Control. Anything else is superfluous, not just in theory but also in the code as it currently exists. Trivial exceptions can automatically resume if unhandled. Normal exceptions are ones that represent an error. They can be handled by any default handler, but cause a program exit when unhandled. Fatal exceptions mark an error that is typically unrecoverable unless a special exit handler has been specifically configured to catch such events. Control exceptions bypass the error-reporting system and are used to implement non-error control flow. I'm hard-pressed to come up with any other designations we would ever need for this mechanism.

typedef enum {
EXCEPTION_BAD_BUFFER_SIZE,
EXCEPTION_MISSING_ENCODING_NAME,
EXCEPTION_INVALID_STRING_REPRESENTATION,
EXCEPTION_ICU_ERROR,
EXCEPTION_UNIMPLEMENTED,
EXCEPTION_NULL_REG_ACCESS,
EXCEPTION_NO_REG_FRAMES,
EXCEPTION_SUBSTR_OUT_OF_STRING,
EXCEPTION_ORD_OUT_OF_STRING,
...
} exception_type_enum;

There are a huge number of exception types, and they really seem superfluous when you consider that every exception must contain a message field with a human-readable message that describes it and a payload field that can contain any arbitrary object with additional data. I know that the intention with this huge list is to implement exception types without using subclasses. The reason for this is that subclasses can be largely expensive because each subclass needs to have it's own VTABLE and other information which can become prohibitive if we want to have more than a few types. I've recently put forward an idea for allowing extremely inexpensive subclasses which was inspired by exactly this problem. My idea was not without it's caveats, of course, but it's not the only possible route to take to make the subclassing operation less expensive. That said...

The payload more specifically identifies the detailed cause/nature of
the exception. Each exception class will have its own specific payload type(s). See the table of standard exception classes for examples.


So every Exception has a payload, which can be a user-defined object type with information about the exception type, and it needs to have one of these dozens of enum values that indicates it's type? This is all highly redundant, and there are at least two paths we could follow to make this system sane:
  1. Only have one type of Exception PMC with no subclasses. Get rid of the type enums. The Exception "type" can be determined from the user-specified payload, if any. Add opcodes or methods that better facilitate throwing an exception with a custom payload. We're likely going to need to define several "Payload" PMC types to handle those exceptions thrown by core. This would require implementing cheap subclasses, but has the benefit that built-in types can be overridden by HLL types if needed.
  2. Have many subclasses of Exception. Get rid of type enums. We only need a throw_p opcode and can construct "new ['ICUError']" objects or whatever we need. This is going to require implementation of cheap subclasses, and will allow HLL type overrides if needed.
Either way, a major improvement over what we have now.
Exceptions have been incorporated into built-in opcodes in a limited way. For the most part, they're used when the return value is either impractical to
check (perhaps because we don't want to add that many error checks in line), or where the output type is unable to represent an error state (e.g. the output I register of the ord opcode).


Color me stupid, but isn't consistency of interface a good thing? How do we know, without having to memorize the behavior of all 1302 ops, which throw exceptions to signal errors and which do not?

Other opcodes respond to an errorson setting to decide whether to throw an exception or return an error value.


I think this should be the default behavior. All ops should throw exceptions on error if "ops throw exceptions" is turned on. Otherwise, no ops do. This setting is cheap enough to toggle.

{{ TODO: "errorson" as specified is dynamically rather than lexically
scoped; is this good? Probably not good. Let's revisit it when we get the basic exceptions functionality implemented. }}


Good point! Maybe an opcode for this isn't a great idea. Methods on the ParrotInterpreter object (to set global settings) and methods on the CallContext PMC (to set local settings) would be a good alternative. When is the basic implementation expected?

{{ NOTE: There are a couple of different factors here. One is the ability to globally define the severity of certain exceptions or categories of exceptions without needing to define a handler for each one. (e.g. Perl 6 may have pragmas to set how severe type-checking errors are. A simple "incompatible type" error may be fatal under one pragma, a resumable warning under another pragma, and completely silent under a third pragma.) Another is the ability to "defang" opcodes so they return error codes instead of throwing exceptions. We might provide a very simple interface to catch an exception and capture its payload without the full complexity of manually defining exception handlers (though it would still be implemented as an exception handler internally)


Another warning in the same vein as the previous note. The point here is that we may want to say that some opcodes throw exceptions, but that we may want those exceptions to have different effects under different "pragmas". This kind of system can be hugely expensive if every error-capable opcode needs to check not only whether to return an error code or throw an exception, but also what the severity of that exception is depending on a series of pragmata that, most likely, would need to be lexically-scoped anyway. Way too complicated. Far better is to enable cheap subclasses of Exception, and have the HLL hot-swap type-maps at runtime with different behaviors such as different severities. Or better yet, forget hot-swapping and instead introspect on the Exception subclasses' Class object to change the default severity values and behaviors. That way when the new Exception object is created, the initialization routine sets a different default severity, the op throws it no matter what, and the exceptions system handles things like it is supposed to do.

So that's my in-depth critique of the Exceptions PDD. I may make it a regular feature to go through other PDDs as well, and I'm sure I'll post other ideas, proposals, and insights for this system in the future as well.

Tuesday, February 23, 2010

Cheap Subclasses

I had an idea the other night when reading over PDD23. That PDD talks about the intention to have an entire hierarchy of exception types, but then mentions a caveat that having too many types is expensive. That got me to thinking, does it really have to be so expensive to make subtypes?

In Parrot when we create a subtype we first create a new VTABLE struct. This struct contains function pointers to all the VTABLE interface functions, plus a small amount of metadata about the class. The VTABLE structure contains a string that is the class name, and a pointer to the Class or PMCProxy PMC that defines the type. There are several function pointers in the VTABLE structure. On a very quick count tonight it looks like there are about 184 of them, and before the vtable_massacre branch merged there were significantly more. Plus other fields, there are over 200 pointers (or fields with equivalent size) in that structure. It's a huge amount of memory to hold for every type, especially if HLLs are expecting to be able to create large amounts of their own types.

Now, consider a case like what is described in PDD23, where we have several exception subtypes which appear to differ from each other only by name. It's a huge waste to give each of these subtypes it's own 184-pointer VTABLE structure, when they are all going to be mostly identical. It's absurd to do it that way, and this is probably a big reason why we don't support the subtypes as described in PDD23.

Consider now the case of user-defined classes and subclasses. This is, I suspect, the largest set of types for most applications. Every PIR-defined object type is an Object PMC, which means the VTABLE structure in C for every user-defined type is 99% identical to the VTABLE structure of Object. All the function pointers, all 184 of them, are identical. The associated NameSpace PMC (after chromatic's refactor the Class PMC instead) contains a list of all the :vtable and :method Sub PMCs. The VTABLEs in Object all search the NameSpace for an override and then launch that override if provided. So for types defined in PIR, we don't need the whole VTABLE struct: just the pointer to the Class PMC that contains the info. We can point the VTABLE pointer to Object's VTABLE and use it without needing an expensive copy.

Instead of creating a Class PMC and a VTABLE structure with over 200 pointers, we only define the Class and the handful of defined overrides that we already define anyway. This is significant memory savings for applications that define many types.

There are two options to implement this kind of idea:
  1. Add a PMC* pointer to every PMC that points to the Class or PMCProxy object that controls it. This could create a mess in GC if Class and PMCProxies weren't marked constant.
  2. Define a new "PMCType" structure. PMCType would contain pointers like a string name, a Class PMC pointer, and maybe a VTABLE pointer. If we add this structure, PMCs get larger by one pointer. If we replace the VTABLE struct and include a pointer to a VTABLE in the PMCType, we have to suffer an additional pointer dereference per VTABLE call (with opportunities to cache).
So this system is not without it's tradeoffs, but with this in place we gain the ability to define large numbers of cheap subclasses of built-in types like what is specified in PDD23, but we also significantly simplify the process of creating new classes in PIR and reduce the amount of memory required for each type.

Parrot's Exceptions System

I've been vaguely unhappy with the exceptions system for a while now. Everybody knows that the implementation really hasn't caught up with the spec, and until now I've been pretty happy to write off all my problems as being an artifact of an incomplete implementation. Plus, I've seen some of the great work that some of our developers have done fixing various bugs and implementing various changes, and I'm always willing to let problems slide under the rug if I know good minds are working on them. Today, however, I was talking to Austin and he expressed some criticisms on IRC that really do a great job of expressing the thoughts I (and others) have had, and show that maybe it's the spec that's the problem, not the implementation:

I was going to embark on a rant about this, but then I read the PDD, and i realized the entire exception subsystem is a farce.

That which is documented is inadequate and poorly thought out. And that which is implemented doesn't do even remotely what is documented.

The pdd makes the assumption that exception filtering will be done based on 'type', but provides no mechanism for extending the 'types'. The logical (and widely popular) alternative is to filter based on subclass. The pdd's answer to that is that you can throw anything, if you just stuff it in the payload. So naturally, the parameters to the exception handler objects are the...

...exception and it's *message*.

The throw/rethrow ops differ in that rethrow marks the exception unhandled. IMO, rethrow should be transparent - particularly, the exception backtrace should still point at the original location where the exception occured. The pdd makes nothing of this, and naturally parrot gets it wrong.

There are too many categories of severity, too many attributes (backtrace versus resume versus thrower; severity versus exit code versus type versus class).


So there you have it, a pretty succinct criticism of Parrot's exception system. I'll be elaborating on some of these ideas in the next few days.

Monday, February 22, 2010

Haskell with LLVM

[Update 23 Feb 2010: I've been informed that this was not a JIT, but instead a native-code generation backend for LLVM demonstrating LLVM's aggressive optimization potential. These numbers are not representative of JIT performance.]

Several people sent me a link to a very interesting blog post yesterday about using LLVM to provide native code generation for Haskell in GHC. I recommend it as an interesting read.

One thing I will point out is that the blog post doesn't really explain the whole situation. He shows plenty of examples where LLVM improved performance, but only mentions briefly that this isn't typical of larger programs and that most programs won't experience as much speed up, if any. So to anybody who reads this remember the caveat that the results aren't typical. JIT speeds some things up and slows some things down. It's not a magic bullet that makes everything better.

ParrotProjects: February 2010 Edition

I haven't posted a ParrotProjects update in a while, but that doesn't mean development of new projects has slowed down at all. Quite the contrary, there are plenty of new projects popping up left and right.

Fun

I can't speak towards how enjoyable it might be, but Fun is an implementation of the Joy language on Parrot. It's still early in development, but it is exciting to have more functional languages targetting Parrot like this.

Digest-Dynpmcs

The Parrot repo currently contains a few dynamic PMCs (dynpmcs) for calculating digests such as MD1, MD4, MD5, and various SHA sums. It has been decided that these kinds of things should find a new home, so our own enterprising developer darbelo has forked them to a new home on Gitorious. Copies of these PMCs are still in the repo pending a deprecation cycle, but after the 2.3 release they will only live on Gitorious.

ParrotSDL

If you need to write any SDL applications, you might be excited to hear that bindings for the multimedia library for Parrot are in active development. ParrotSDL is still a new project and is navigating through some difficulties with Parrot's NCI system. The lead developer, Parrot newcomer kthakore, is also working on SDL bindings for Perl5, so he's very familiar with the whole system.

kthakore is looking for PIR coders to help with the project. Chat about ParrotSDL and the Perl port happens at irc://irc.perl.org/#sdl

NQ-NQP

You might think of NQP as being a language that only runs on Parrot, but you'd be wrong. Developer ash_ has been working on a variant of NQP that runs on top of LLVM instead. This compiler, which is not quite NQP, is a very interesting project and may help to inform our future use of LLVM for Parrot's JIT system.

Friday, February 19, 2010

Opcode and OpLib PMCs

A few days ago, after some discussion with NotFound and others on #parrot, I started a small branch to experiment with some new PMC types. The results of that work were the two new experimental PMC types Opcode and OpLib. The branch merged into trunk shortly after the 2.1.0 release, so now they are available--experimentally--for people to test and use.

OpLib provides an introspective accessor layer over the interpreter's op table. The OpLib allows us to get a current count of the number of opcodes currently loaded in the system. It can also be used to return the index number of an opcode specified by name, or the name of an opcode given by it's number. On one hand it's important to hide these kinds of details from the average PIR user for reasons of backwards-compatibility and encapsulation. However, for the people writing PIR assemblers and disassemblers in PIR, the information is vital.

These PMC types are read-only types. You can use them to read information about the opcodes in the system, but you can't manipulate that information. However, I'm not against that capability entirely. Imagine the ability to remap an op number to a new custom opcode at runtime. This would allow us to write tools that can attach to live PIR code such as memory usage analyzers, profilers, watchdog monitors, etc. Of course, in most cases this capability would horribly crash the program if used incorrectly, but in the right hands it has much potential. This, if it happens at all, is a long way off.

These two PMC types are still immature but they, along with the ever-improving Packfile PMCs, are already starting to enable some cool new applications. We don't quite expose all the information yet that we need to do complete compilation or decompilation, and some improvements are needed in Parrot itself to fill in some of the remaining gaps, but we are getting closer.

Before the 3.0 release I think we will have a PIR/PASM compiler that runs on top of Parrot natively. This could be written in PIR, of course, or one of the other cool developing languages such as NQP, Winxed, or something else. With this, we could cut IMCC out of the loop almost entirely if we wanted. We could also easily come up with new assembly languages or language dialects for interacting with Parrot. My dislike for PIR is not a secret, so the ability to come up with another, better, assembly language for working with Parrot is an idea that makes me very happy.

Argument Passing Refactors

On tuesday it was decided that the next round of PCC refactors should start this sprint. Allison created a branch for the task, after having created a detailed tasklist for it in the previous weeks. To understand what the point of the refactor is I first need to describe the system as it is now.

When we make a function or method call in Parrot, we use fancy-schmance PIR that looks like this:

($P0, $I0) = foo(1, 2.0, $S0)

This looks all well and good, and certainly makes the programmers happy to see familiar syntax. Internally, this call is anything but pretty. In PIR, we can construct a call using a more verbose syntax with some compiler directives:

.const 'Sub' foo = 'foo'
.begin_call
.set_arg 1
.set_arg 2.0
.set_arg $S0
.call foo
.get_result $P0
.get_result $I0
.end_call

This is much worse in terms of syntax and verbosity, but at least it makes good explicit sense: We find the sub object, we get the arguments, we call the function, then we get the result values. This seems all well and good, but this isn't the bottom layer of the cake. These things above are IMCC compiler directives, not actual bytecode. The actual bytecode of the file looks much more like this:

$P97 = find_name "foo"
$P98 = new ['String']
$P98 = "0x0010,0x0013,0x0001"
set_args $P98, 1, 2.0, $S0
$P99 = new ['FixedIntegerArray']
$P99[0] = 0x02
$P99[1] = 0x00
get_results $P99, $P0, $I0
invokecc $P97

There are a few things we can immediately see about this code listing that are a little bit obnoxious. I'll list them out in no particular order:
  1. get_results is called before invokecc. This means we are preparing to retrieve results before we've even called the function. The actual process of copying returns from the callee into the caller happens inside the callee. This creates a fundamental disconnect in a system that is supposed to be continuation-based.
  2. set_params takes a string PMC containing a string of hex values containing flags corresponding to each argment. Inside set_params, that string needs to be painstakingly parsed to get a proper array of flags.
  3. set_params and get_results opcodes both take variadic argument lists. It's impossible for something like a bytecode disassembler to figure out how much memory the opcode takes up without reading the first argument and determining how many flags are specified.
Allison's current branch is intending to address #1. She's going to reverse the logic so that results are collected after the returns are passed. This will allow us to unify the code paths that handle function calls and returns into a single function. Hopefully this will lead to a few optimizations.

#2 and #3 above are a little disconcerting for a variety of reasons. First, we have all the necessary information about the call at compile time. We have the number and types of the arguments, and all the associated flags that govern what they are and how they are used. All this information is passed directly to set_args, which uses it to built a CallContext PMC.

To recap, we have all the information we need to build the CallContext PMC at compile time.

So let's ignore for a second how stupid it is to iterate character-by-character over a String PMC to get the flags, when it's obvious that the results mechanism uses a much better suited integer array for the same purpose. The question isn't how we store the flags in the bytecode, it's why we're bothering to store them separately at all? Why don't we create a CallContext PMC constant, or maybe some new kind of "CallArguments" PMC constant at compile time, cache it in the bytecode in exactly the form we need the data to be in, and use that when performing calls?

The question is a rhetorical one, and I've opened a ticket to suggest we bring a little bit of sanity to this code and maybe see some serious performance wins as well. Since Allison is already working on this code, it should be pretty easy to build on that momentum and fix the last major wart that the calling code has.

Special Release: Parrot 2.1.1

Earlier today we got a bug report from the #perl6 folks that Parrot was leaking memory. chromatic put a fix together and it was decided to cut an emergency bug-fix release. Since Rakudo bases it's releases on the previous Parrot release, and they can't really put out a release that is known to leak memory, they now have the special Parrot 2.1.1.

So if you were using 2.1.0 for your application and would like to plug a memory leak in long-running programs, please update to 2.1.1 instead.

Tuesday, February 16, 2010

Parrot 2.1.0 Released

Just a few moments ago Daniel Arbelo (darbelo) released Parrot 2.1.0 "As Scheduled". It's always nice to see a software project released on time and under budget!

Parrot 2.2.0 is scheduled for 16 March, and will be released by Christoph Otto (cotto).

Monday, February 15, 2010

Software Anti-Pattern: The Exception Repeater

Since I've seen this pattern several times in a software project I'm trying to maintain, I've decided to give it a name: The Exception Repeater anti-pattern. Here's a particularly succinct implementation of it, direct from the repo:

public static void ValidateData() {
try {
ValidateDataInner();
} catch (Exception e) {
throw e;
}
}

Since I've seen this exact thing several times in the codebase today, I'm wondering if maybe it has some kind of magical benefit or side-effect that I'm not aware of. The only benefit that I can think of is that an unintelligent programmer can claim to be programming "defensively", or something like that.

Thursday, February 11, 2010

GSoC Idea: Immutable Strings

Immutable strings are an interesting idea. Fundamentally, in a system with immutable strings the actual string storage is considered immutable at the lowest level. This means that when we modify strings, we create copies instead of modifying the strings in place. Immutable strings carry some benefits. First and foremost, we never need to resize strings in place. This means that memory can be allocated more linearly, and can be kept track of more simply. Instead of a string header needing to contain both the size of the currently allocated buffer and the size of the string that buffer contains, we only need to store one size value.

When we look at a code example like this:

$S0 = foo($S0)

It will look to the programmer as if the string register $S0 has been modified to become whatever the output of the foo function is. However, this is only true of the header being modified. Internally, the header will be pointing to a new memory buffer entirely.

Parrot currently uses a system called "Copy on Write", or COW. In a COW system making copies are cheap because we make incomplete copies. When we do this in Parrot:

$S1 = $S0

We aren't copying the string, we are only copying the header. Both headers will point to the same underlying memory buffer but now a COW flag will be set. If an attempt is made to modify either string, a complete copy is made first and then the modification happens. In theory this is a great optimization: We create copies lazily, and don't have to actually copy memory buffers (which can be expensive) until a modification is made somewhere.

The problem with the COW system is that Parrot has to make a lot of copies of strings, especially for strings which are constant or are being passed as parameters to subroutines. Consider this code:

$P0 = new ["foo"]
$S0 = typeof $P0
$S0 = $S0 . "bar"

Here $S0 is orignally the typename of the $P0 object, "foo". Obviously the name of the class should be constant, if we are allowed to just rename the class like this, we could cause all sorts of problems in the code. So the typeof operator returns a COW copy of the string, so that any modifications happen to a lazy copy.

The problem is that most accesses of class type names are not like above, but are instead like shown below:

$S0 = typeof $P0
if $S0 == "foo" goto ...

or

$S0 = typeof $P0
$P1 = new [$S0]

In cases like these, $S0 is treated as a constant string in the code to key changes in program flow. Most of the time the output of the typeof operator is treated as a constant so a copy doesn't need to be made, lazily or otherwise. In fact, having to make COW headers every time can become extremely wasteful. A similar situation happens in the PCC system, where strings passed as parameters to subroutines are passed as COW copied, not as straight references. This likewise can become extremely expensive when a COW header isn't needed.

I've rambled enough, I can talk about the theory behind COW and Immutable strings another time. For now, let's get to the GSoC project.

For GSoC, myself and several other developers (chromatic especially) would like to see Parrot converted to use immutable strings instead of COW. This could be done in several steps:
  1. Conversion of all string functions to take a arguments and return a string pointer result, instead of modifying argument pointers directly.
  2. Modification of the string allocator to use immutable strings instead of COW (involves changes to a particularly messy portion of the GC)
  3. Deprecating and eventually deleting all the functions that deal specifically with COW
  4. Creation of a new PMC type, which would be a piece-wise string builder type.
  5. Optimizations and improvements throughout the codebase to rely on the behavior of immutable strings to prevent unwanted changes to constant strings.
  6. Benchmarks of the new system over the old system to compare performance changes
So that's the GSoC idea involving immutable strings. It's slightly ambitious, but it is pretty easy to break it up into smaller, easier steps. Interested, motivated, and talented students are definitely encouraged to apply.

Wednesday, February 10, 2010

So Bad It's Funny

At work I've been taking a program written by another engineer and trying to merge the functionality into a program that I've been writing. During this process, I have found some code so bad that it's both funny and depressing at the same time. Here is a great example:
[Update 10 Feb 2010: This code is C#]

public bool CheckMsgValid(byte[] _myArray, int _myIdx) {
bool t;
t = ((_myArray[_myIdx] & 0x80) == 0x80) ? true : false;
return t;
}

...which is basically the long way to write:

public bool CheckMsgValid(byte[] b, int x) {
return b[x] & 0x80 != 0 ;
}

I guess the original coder wanted to make double-sure that the two values were equal.

Tuesday, February 9, 2010

GSoC Idea: Bytecode improvements

Bytecode.

It's a part of the VM that really "just works" and so nobody spends much time playing with it. This is not to say that nobody has touched it, but in my tenure as a Parrot developer I have not seen nearly so much development on the entire subsystem as I have seen in other places. This has been changing recently with some cleanups to the freeze/thaw system, which is related to the bytecode system. There are several packfile-related PMC types waiting in the wings, but Parrot only uses them as a thin wrapper layer to provide PBC access from PIR.

There are several issues that need attention in the realm of Parrot's bytecode. Some of these issues, or combinations thereof, could make for very interesting--and very rewarding--GSoC projects.

First project that I would like to see is in making Parrot's bytecode more portable. Currently bytecode isn't really portable between different systems like it should be, nor is it portable between different versions. Everytime somebody increases the bytecode version number in the PBC_COMPAT file, all previously-generated bytecode files become invalid.

One way to tackle this problem would be to include metadata in the bytecode file. This metadata could include a list of PMC types and their corresponding type numbers, and opcodes and their corresponding index numbers. A verification step at Parrot load could check versions and, if different, could loop over this metadata and attempt to remap old numbers to the new numbers.

...And while we're talking about bytecode portability, Parrot needs a "robust testing strategy" for testing and verifying PBC files. I mention this because I see the warning several times per test run, every test run, every day. Seriously, somebody fix this warning!

The packfile system code is terrible and very complicated. We do have those PMC types that act as thin wrappers over the packfile structures and APIs. One thing that I would like to see is a complete refactor of the system that replaced all of the packfile structures with the equivalent, properly-encapsulated, PMCs.

Parrot's bytecode also stores long lists of the constants used in the code. These long lists, unfortunately, often contain duplicates. A post-processing step on the generated bytecode could perform some basic constant folding on the code to decrease the final size of the generated PBC file.

And speaking of constant folding, there are plenty of other optimizations that could be performed on bytecode. Hooks to add optional optimizations would be a most welcome addition, especially if we could notably decrease file size or even improve performance (though performance-improving optimizations would probably be better suited at the PIR or PAST levels). One such optimization that could be useful is a translator routine that converts bytecode files to a "native" format where data is in the correct byte order.

The pbc_dump program, which attempts a basic disassembly of a bytecode file, is woefully inadequate for most needs. Major improvements to this would be nice. A full-featured PBC disassembler that could produce round-trip capable code would be even better. The pbc_merge program similarly is in need of some love. Major improvements to this could be made as part of a larger set of refactors. Tangentially related but still important is a proper PIR->PASM translation utility. IMCC has an option to do this, but produces output code that is known to be incorrect and incomplete.

The freeze/thaw system converts individual PMCs to serialized strings suitable for saving to a file and then loading again at a later time. Better integration of the freeze/thaw code would enable storing all Parrot data, be it bytecode or data, to a similarly-structured file. Eventually we could see the freeze/thaw types be inheritable, so users could define how their data is serialized. Information in the file could direct Parrot which serialization types are required to read the data later. Some ideas of subtypes of the PMC serializer type are serializers that checksum data for later verification, or types which encrypt or obfuscate the data for security purposes.

So these are a few random ideas about projects that have to do with the PBC system. I think this is an area that is ripe for some GSoC-level projects. Also, I think projects in this area would be majorly beneficial to a lot of people and would help to make bytecode a compelling format for shipping portable executable files to end-users. If you like one of these ideas, or if you have ideas of your own to fix these systems, please do let me know!

Monday, February 8, 2010

GSoC Idea: Parrot + RTEMS

By background I'm an embedded systems guy. In college I spent years focusing on microcontroller systems and FPGA hardware. I'm not going to spend the whole post reminiscing about the "good-ole' days" when I still worked down on the metal, but I do want to express my particular fondness for this next GSoC project idea: The Parrot-on-RTEMS port.

Jonathan Leto has been spearheading the project so far, but it's become clear that there are some major deficiencies in Parrot's cross-compilation facilities to enable Parrot to build and run properly on a real-time OS like RTEMS. Also, there are other problems with Parrot's algorithms that prevent it from meeting strict real-time deadlines.

In a message to the mailing list tonight Johnathan put out an official-looking call for interested participants, and also advertised an extremely informational page on the RTEMS wiki.

I will let the RTEMS page discuss most of the good details about what is needed, so I will cut off this blog post here. But if you're an interested student and want more information about this project, head on over to the RTEMS wiki and get in touch with Jonathan Leto.

Sunday, February 7, 2010

ParrotTheory: Threading

Threading is one of those technologies, buzzwords, that is supposed to be the future of computer programming. It's one of those things that is supposed to be all things to all people: improved scalability and performance, better utilization of hardware, and it can even cure the blind. The idea is that threading will allow applications to scale and take advantage of new multicore processors that companies like Intel and AMD are creating. Like any tool, threading can be a huge help to certain people for certain classes of applications. However, threading can also bring with it a large number of problems that actually hamper program performance. To understand this dichotomy it's important to understand what threads are and how they operate.

In a very narrow sense, a thread is an executing context. On a hardware platform like x86 we can define an executing context by knowing the contents of the processor registers, the contents of the stack, and the sequence of machine code instructions that the program is following. The instruction pointer (IP) register in the processor points to the current machine code instruction, and after a command is executed the pointer is updated to point to the next instruction. By taking a snapshot of a context and saving it, we can essentially "freeze" the state. We could save the context to disk, and re-load it later to continue where we left off.

A single-processor CPU core is a very linear device. Machine code instructions are executed in order starting at a start address and following linearly through memory. The one exception to this, branches, allow control to jump to arbitrary addresses where linear control continues. When you look at a modern single-core desktop machine it appears that things are happening concurrently and multiple programs are executing at the same time. This is only a facade. The OS breaks execution into time slices where programs execute one at a time. Moving control from one process to another is called a context switch. Each process is allocated individual slices of time, and if the slices are made small enough and context switches happen often enough, it appears to the user that things are happening together in parallel. It's a play on the limitations of human perceptions.

Threads can be preemptive or cooperative. A cooperative threading system, by far the least common type, passes control to a thread until that thread explicitly relinquishes it. A cooperative system has a significant benefit: the program knows when it's execution will be paused and can therefore avoid much non-deterministic behavior. Alternatively and considerably more common is preemptive multhreading where the OS controls thread context switches. The thread executes without any knowledge of other threads and without any notion of cooperation. At a seemingly-random time the OS halts the thread, saves it's current execution context, puts the thread into a queue, and loads the next thread to execute. This system brings the benefit that programs can be written without any knowledge of theads and still be run on multiprocessing systems. Also, we have the benefit that no one program can "hog" system resources: the OS makes certain that all programs get fair opportunity to execute. On the other hand, preemptive multithreading brings a certain amount of non-deterministic behavior and creates whole classes of problems like memory corruption and deadlocking that do not exist otherwise.

In a preemptive multithreading system (which is the most common and the only type I will consider here), each thread is a structure in the OS. Threads are managed by a process called the thread scheduler. The thread scheduler is typically implemented as an interrupt handler attached to a hardware timer device. When the hardware timer device sends a trigger signal to the CPU, the scheduler is executed. Here is an example of what a scheduler looks like:

interrupt void ThreadScheduler() {
ExecutionContext lastctx = SaveCurrentContext();
ExecutionContext nextctx = GetNextContext();
Schedule(lastctx);
LoadContext(nextctx);
}

When the scheduler activates the current execution context, which consists of the contents of the processor registers and ID of the memory pages assigned to that thread, we load the necessary pages back into memory and continue execution from the point where it was last interrupted as if nothing has happened. Multiple threads can be executing in a single process, in which case they share the memory pages with executable code in them. However, each thread is going to have it's own stack, and likely it's own heap pages as well. This is how multiple threads in a single process are differentiated. The current context is enqueued and the next executing context is popped off the queue.

Because of the necessary operations of saving the current context, enqueuing the current context, dequeueing the next context and loading the next context, threading is inherently slower for linear operations than non-threaded systems are. Threads always carry a performance cost in terms of context switching and a memory cost in terms of maintaining separate stacks and context structures. In addition, the more threads we have, the less frequently each individual thread runs. To understand why, assume we have a system that switches threads 10 times per second. With only one thread, it runs 100% of the time. With 10 threads, each only gets one tenth of every second to operate. With 100 threads, each thread only gets a one tenth-of-a-second opportunity to execute every 10 seconds. These are not necessarily a large cost (in fact in many systems it can be negligible), but it does exist. In exchange we gain the ability to simplify and encapsulate separate tasks, create the illusion of concurrency, and (most importantly for graphical systems) limit the pauses the user experiences while the system is processing another task.

On multiprocessor or multicore systems, we also gain the benefit that threads can run on separate processors, truly in parallel. In this way, as the number of processor cores increases, so too can the performance of an application improve if it uses enough threads to fill those processors. In these situations, a program can maximize it's throughput if it has as many executing threads as there are available processor cores to run them on. Too many threads and we experience costly context switches. Too few threads and processor cores lay unused.

Context switches can only happen at instruction boundaries. This means that an individual machine code instruction is atomic: a context switch can happen between machine code instructions but cannot happen in the middle of an instruction. However, beyond this guarantee there is no way for the program to determine ahead of time where in the program execution these switches will happen. This creates nondeterminism which can cause bugs.

So what kinds of bugs can be created? Let's consider a structure with two integer values which are defined as a psychotic redundancy measure. The two are supposed to always be copies of each other, and if they are ever different the program will freak out and crash. Here's an access routine to change the values at once:

modify_data(my_struct* s, int newvalue) {
s->data1 = newvalue
s->data2 = newvalue;
}

This seems straight forward. Now consider the (admittedly contrived) case where two threads are calling this function with the same structure pointer, and a context switch happens between the two statements. Thread 1 updates the data1 to 1234 and a context switch happens. Thread 2 updates the data1 and data2 to 4567, followed by another switch. Now thread 1 updates data2 to 1234, and the two values are now not equal. The structure is left in an inconsistent state, the program freaks out, the plane crashes, and all the orphan children die in a fire.

To avoid inconsistencies like this we can introduce any number of concurrency lock primitives, such as mutexes, semaphores, spinlocks, critical sections, or whatever else people create for the purpose. It's beyond the scope of this blog to talk about all these things individually, so I might save the discussion for another blog post later. Regardless of the exact method we use, the code now looks like this:

modify_data(my_struct* s, int newvalue) {
lock (lock_object) {
s->data1 = newvalue;
s->data2 = newvalue;
}
}

And the two threads are somehow prevented from both entering the function at the same time. If one thread tries to enter the lock before the other thread has left it, we force an immediate context switch to let the first thread finish and exit before the second thread can continue.

So here's another good rule about threads: In addition to the cost overheads of switching contexts and managing threads, there are also costs involved in managing access to shared resources. All these costs combined can really make threading a performance drain instead of a performance boon. Plus, the need to properly restrict access to shared resources puts a strain on programmers and can increase the number of bugs in programs. Threads are but one tool in the toolbox of a skilled programmer, and should be used with care.

Saturday, February 6, 2010

Parrot-Data-Structures Benchmarks

When I first conceived of the Parrot-Data-Structures project, I envisioned a place where we could develop performance-optimized PMC types. A part of proving that we've improved performance is to provide benchmarks. So, this morning I went through and wrote up some naive benchmarks to compare several of my new PMC types against the venerable ResizablePMCArray. I didn't compare against the FixedPMCArray because the latter doesn't support push/pop/shift operations and I wouldn't be able to make direct algorithmic comparisons.

I've only put together one benchmark so far for stacks: We push 1,000,000 items onto the stack and them pop them all back off again in tight loops. This forces the stack to resize up to 1,000,000 items worth of storage. The times were small and I could have upped it to ten million items or more, but then we start to see more effects from caching pages to disk and lose insight into the application we are testing.


(FixedPMCStack) Time to push 1000000: 0.0456011295318604
(FixedPMCStack) Time to pop 1000000: 0.0357608795166016
(FixedPMCStack) Time to total 1000000: 0.0813620090484619

(ResizablePMCStack) Time to push 1000000: 0.0498058795928955
(ResizablePMCStack) Time to pop 1000000: 0.0467569828033447
(ResizablePMCStack) Time to total 1000000: 0.0965628623962402

(ResizablePMCStack2) Time to push 1000000: 0.0470800399780273
(ResizablePMCStack2) Time to pop 1000000: 0.0364069938659668
(ResizablePMCStack2) Time to total 1000000: 0.0834870338439941

(ResizablePMCArray) Time to push 1000000: 0.0531971454620361
(ResizablePMCArray) Time to pop 1000000: 0.0347628593444824
(ResizablePMCArray) Time to total 1000000: 0.0879600048065186


I've shown three of my types as they compare to Parrot's ResizablePMCArray type. FixedPMCStack is fixed-size, so it's allocated up-front and does not need to be resized on each push. ResizablePMCStack is a linked-list of mini-array chunks. Each chunk holds 16 pointers, so we can push 16 objects before a new allocation. ResizablePMCStack2 is an alternate stack implementation that I put together today. It uses a flat memory buffer but does geometric reallocations. Starting at 16 objects, every time we run out of space we allocate twice as much (16, 32, 64, etc). Finally is the ResizablePMCArray, which resizes the buffer on each push. This start size and the growth factor can be tuned, though I haven't made any efforts to do that.

FixedPMCStack obviously wins the competion since it only needs to malloc once and never needs to reallocate or free the memory. At this sample size the benefits are not huge over the other types. ResizablePMCArray2 eeks out a win over ResizablePMCArray for this sample size. However, at smaller samples such as 10,000 objects to push, ResizablePMCArray wins instead. I suspect we could tune the size of allocated chunks in ResizablePMCStack or the starting allocations and grow factors of ResizablePMCStack2 to alter these results and the threshold where the first type becomes less efficient than the latter type. As the data set gets larger, the geometric growth of RPS2 takes over and severely decreases the number of allocations that need to be made, while for the basic RPS type, the size and frequency of allocations is constant.

ResizablePMCArray performs well enough in these benchmarks. It does more allocations than any of my types, but uses a relatively efficient flat memory buffer to hold data, so it's not blown out of the runnings entirely.

Now for the FIFO Queue types. For these types, I used 100,000 items in a similar loop (push 100,000 items, pop them all). I didn't use 1,000,000 like I did in the stack tests above for a reason I will discuss below.

(FixedPMCQueue) Time to push 100000: 0.00739598274230957
(FixedPMCQueue) Time to shift 100000: 0.00774002075195312
(FixedPMCQueue) Time to total 100000: 0.0151360034942627

(ResizablePMCQueue) Time to push 100000: 0.0121519565582275
(ResizablePMCQueue) Time to shift 100000: 0.00611591339111328
(ResizablePMCQueue) Time to total 100000: 0.0182678699493408

(ResizablePMCArray) Time to push 100000: 0.00558805465698242
(ResizablePMCArray) Time to shift 100000: 5.47745704650879
(ResizablePMCArray) Time to total 100000: 5.48304510116577


FixedPMCQueue uses a pre-allocated memory buffer, which is a big saver and makes it the obvious winner in the category. However, since it's setup as a ring buffer each push/pop operation requires a few extra pointer checks that the other types don't need to go through. This is why the ResizablePMCArray wins among all types in push performance.

Shift performance is a different issue entirely. FixedPMCQueue again isn't the winner here but it is close. ResizablePMCQueue does just a little bit better here using it's efficient linked-list implementation. Even though each linked-list node needs to be free()'d, the implementation of free() in libc is pretty lightweight compared to the cost for malloc(). In fact, all things considered, it looks from the numbers above that free() is about twice as fast as malloc(), all other things being equal.

Shift is where the ResizablePMCArray type loses it's composure completely. FixedPMCQueue and ResizablePMCQueue both took about 1.5 one-hundredths of a second to complete the benchmark. ResizablePMCArray took about 360 times as much to perform the same operation. And the reason is that it took almost 5.5 seconds just to do 100,000 shift operations. And that's not even the worst of it. RPA.shift() is O(n2) complexity. When I tried to bump this benchmark up to one million items, the benchmark ran for over 10 minutes before I had to kill it with no end in sight. Both my queue types are time-linear, and bumping up to one million items took only 10 times longer for them. Why is ResizablePMCArray O(n2)? Because each shift operation requires a memmove, which loops over each item in the array and moves it to a new buffer. This is terrible and one of the reasons why I started the parrot-data-structures project in the first place.

I plan on adding a few more benchmarks. For instance, rapid-fire push/pop benchmarks that don't cause resizes might be interesting to isolate the per-primative operation cost without considering memory allocation costs. Your average program doesn't need to push one million items onto a stack or queue, of course. And with these benchmarks I'll be able to focus some optimization effort on these types to make them better.

The overall lesson to be learned from this post is this: For stack-like operations the ResizablePMCArray is a decent—though not perfect—tool. For queue-like operations on the other hand, ResizablePMCArray is lousy and should be avoided. At least, it should be avoided until we do some optimization effort to make it better.

Thursday, February 4, 2010

Introducing Parrot-Data-Structures

This week at #parrotsketch the question was raised as to why we had differentiated array-like PMC types at all. Why do we have separate ResizablePMCArrays from ResizableStringArrays and the like? For symmetry, why don't we also have separate PMCHash and IntegerHash types?

After the Array PMC was removed last week, plobsing mentioned that RPA had a complexity of O(n2) for shift/unshift operations while the Array PMC had a much better O(n) complexity for those same operations. I had a few ideas about it, but looking through the code I didn't see any easy ways to really fix the performance problems in RPA. At least, I couldn't think of a way to fix the performance of shift/unshift that wouldn't hurt some other metric. There's a certain trade-off in terms of performance we make when a data type becomes more flexible. Maybe it's good enough to just advertise the fact that certain operations, though available, are known to be sub-optimal.

These two points combined to inspire me to create a new project: Parrot-Data-Structures. Parrot-Data-Structures (PDS for short) will contain a collection of specialized datatypes that are less flexible than the standard ResizablePMCArray is, but which are designed to have particular beneficial properties in return. These beneficial properties may be optimized hot-paths for specific operations or high memory-efficiency for certain types of data sets.

I asked whether these types of PMCs belonged in core or whether I should create a new project. The consensus was that I should start a new project and if the structures were great enough we could consider merging them into trunk.

At the time of this writing I have prototyped three new PMC types:
  1. ResizablePMCStack. This type is a dynamically-sizable stack type that is optimized for push/pop performance. I've implemented the prototype as a linked list of mini-arrays, so we don't need to allocate storage on every push, and don't need to deallocate it on every pop. Because of it's architecture, RPS doesn't offer indexed element access, but it does provide a method to convert it to a normal ResizablePMCArray if indexed access is needed.
  2. FixedPMCStack: This is a stack PMC type that uses fixed-size preallocated storage. Push and pop are going to be faster than even RPS, and because it uses a flat memory buffer indexed element access should be possible at good speeds (though this is not currently implemented).
  3. FixedPMCQueue: This is a standard fixed-size first-in-first-out queue structure. It is optimized for push/shift performance. I've implemented the prototype as a ring buffer, so we can avoid costly memcpy and memmove operations while allowing the storage to slowly snake it's way around in memory. Because it's always moving around in memory, indexed element access is not possible at a reasonable speed, so I probably won't implement it. I have provided a to_array() method to convert it to a FixedPMCArray if random access is necessary.
Next on my immediate TODO list are:
  1. ResizablePMCQueue: I'm not sure how I will implement this one, it certainly can't be a ring buffer like the FPQ type is. It will be optimized for push/shift access like FPQ is, but might take a small hit when reallocations are necessary.
  2. SparsePMCArray: The sparse array works best on datasets where most elements in the set are some default value. Instead of allocating storage for all items, we only allocate storage for the interesting ones. This type is optimized for low memory use in sparse data sets. It's worth mentioning that the Array PMC, which has recently been deleted from trunk, was a sparse array type. However, it was a sufficiently bad implementation that I think it's good to just start over and try again.
Less concretely, I've also considered the benefits of providing some additional types, though I'm not sure how far I want to stretch the scope of this project:
  1. PMCHeap: This would be a binary-tree type of structure internally, keeping PMCs sorted according to a provided sort routine. I'm not sure whether I would want this to be more like a "max heap" or a binary search tree.
  2. ThreadSafePMCQueue: This would be a message queue variant that could be used safely across threads.
  3. PriorityPMCQueue: This is a queue type where each element has an assigned priority. Pulling an item off the queue grabs the highest-priority items first. Items of the same priority are retrieved in FIFO order.
  4. PMCGapBuffer: This type is optimized for fast insertions/deletions around a point of focus. Think about a text editor where most text changes happen right near the cursor. In the gap buffer, we would first set a "cursor point", and then we would be able to push/pop from the left side of that cursor, and shift/unshift from the right side of that cursor.
In addition to these potential types I mentioned, there are any number of searchable/sortable data structures we could implement to facilitate fast searching in a collection of elements that aren't keyed. Things like skip lists, jump lists, or Splay trees each have interesting properties that would allow searching for items in a collection in O(log n) time. If people express an interest to me about these kinds of types I might look into writing out prototypes.

These prototypes are all designed for PMC storage currently, though I plan to add similar types for INTVAL, STRING, and FLOATVAL too. This is a lot of PMC types, to be sure, but the specialization could be a huge benefit for applications that need a particular type to perform better in some metric than the Parrot core types can handle. I haven't looked into building or testing these types quite yet, but I have an idea that the build will produce several library files with different combinations of types in them. This way projects can only load the few types they need.

An ultimate goal of this project is to help out Parrot's core. There are two ways this could happen: First, the Parrot team could decide that we don't want a ton of performance-specialized or type-specialized array types in Core, in which case add-on projects like this will be necessary to provide that functionality with good performance. Second, the Parrot team might decide that these types have value and they could be added to Core later.

I actually like the second development path in general: It's a way we can prototype cool new features. Also in the second case, I don't need to worry about the scope of this project expanding too far, because it could always be considered an incubator for interesting, new PMC implementations.

If anybody reading this blog is interested in participating, I've got commit bits aplenty to hand out. If you want to participate in the fun, please let me know.

Tuesday, February 2, 2010

Parrotsketch meeting has moved

The weekly Parrotsketch planning meeting has moved to 20:30 UTC from it's traditional timeslot at 18:30 UTC. For a quick reference, here is the new time for people in this random smattering of timezones:
  • 15:30 USA Eastern (EST)
  • 14:30 USA Central (CST)
  • 12:30 USA Pacific (PST)
  • 23:30 Moscow, Russia (MSK)
  • 07:30 Sydney, Australia (EDT)
  • 05:30 Tokyo, Japan
  • 04:30 Beijing, China