Vale Java? Scala Vala palava

and Go too

By Rick Jelliffe
August 27, 2010 | Comments: 13

Dave Megginson (who drove the development of the SAX API that will be familiar to many XML developers who use Java) recently wrote Java is dead.

Java stood out as a programming language (though not as a platform) in that Sun had refused to standardize it through an independent and reputable standards organization (a lot of the hard work had been done in one attempt to put it through ECMA and one to put it through ISO, both times Sun pulled out and eventually made their highly unsatisfactory JCP Java Community Process system.) Without the ability to alter Java significantly in ways that might go against their druthers, Java suffered two major forks (Microsoft's J++ then its C#, and IBM's SWT) where significant players disagreed with a major component (the graphics library). Java succeeded in middleware, and but failed to take advantage of the rise of browsers on the deskop: their HTML parser was great for the middle 1990s but was deliberately neglected to the point of being unusable: it is hard not to see this as a deliberate attempt by Sun to leave the browser market to its friends and enemies. I really liked Java, and bet my company on it (in a sense): I would not do that today.

What are the alternatives, for the same kind of desktop space (no flames, I know everyone knows this area is dead dead dead in the age of the WWW and the curated ipadroid) and XML processing I spend my time in? I have been looking at Scala quite a bit: it integrates into the JVM, has familiar C-based syntax, and allows you to progress from Java-like programs to more functional and DSL programs. However, I have three qualms: first, I can already do many kinds of functional programs using XSLT which is more optimized for XML processing than Scala, so ; second, its use of the JVM may be a liability given Oracle's lawsuit (or is this just FUD?); third, much of the Scala material is written by academics or proponents of niche languages that are fairly difficult to approach from the world of conventional programming. (Whenever anyone writes monad in discussing the advantages of a technology, its potential popularity plummets in my mind.)

Another contender that does not suffer these three problems is Vala. This is a real throwback to the 1980s in a way: an object language sitting on top of C, but informed by the excesses of C++ and the successes of Java. Here is the elevator pitch for Vala on their website:

Vala is a new programming language that aims to bring modern programming language features to GNOME developers without imposing any additional runtime requirements and without using a different ABI compared to applications and libraries written in C.

Vala is based on GNOME's GObject system, and provides the kind of class features you would expect. Like Scala it has a language-level equivalent to Java Beans get*() and set*() properties, but it also has language-level support for property change notifications and listeners. It does not have a strong XML story, just a mini parser, but it does seem to have a good Unicode story (strings are UTF-8, the string.getch() returns a unichar 16-bit character.) Probably its main difference with Java at a feature level is that that it has a reference-counting system for object de-allocation, which gives more deterministic real-time behaviour compared to Java's GC but opens the dooor to some programming errors. The other oddity, in this age, is that it does not do boundary checking on array accesses: so a little more discipline is required than Java (but presumably a lot less than in C).

I would expect that anyone who would be considering writing a new C or C++ application or who were C, C++, C# or disgruntled Java programmers wanting to avoid the clutches of the large corporations, and with a pretty strong and proven (OS-neutral?) platform behind it (i.e. the might of GNOME) might be interested in Vala. Like Scala, I think you would want to check whether the IDE support was adequate before launching onto a big project, of course.

On the server side, a promising language is Google's Go programming language: it is less object-oriented, but still in that efficient C space: their elevator pitch:

fast...concurrent...safe...fun...open source

Go features a keyword go for spawning goroutines that encapsulate the various fibre/thread/process/multicore kinds of parallelization.

I'd love to see the lovechild of Go and Vala: I don't see that Vala and Scala have a very good story for taking advantage of multicore CPUs.


You might also be interested in:

13 Comments

For anyone interested in cross-platform development I would suggest Qt framework [1] which is written in C++ but has several bindings.

The framework is distributed with 3 licenses so you ought to choose the one that best fits your needs and interests:
http://qt.nokia.com/products/licensing/

[1] http://qt.nokia.com/

Another contender in the Vala / Go line of aproaches, may be the rust programming language

http://blog.mozilla.com/graydon/2010/10/02/rust-progress/

Well, I definitely recommend against rust.

It has a policy of ASCII only lexemes. It is just plain ignorant to say that non-English programmers always write with ASCII. (Just as it would be ignorant to say that they never do.) It is that kind of rather blythe dismissal that foreign cultures and languages need to be supported that creates extra unnecessary barriers. That argument ran out of legs in the early 1990s: all platforms have well -established Unicode libraries with serviceable properties for this.

Scala is very good in this regard, by the way.

On the issue of whether foreigners use ASCII lexemes, why is the question even asked? Why is it up to the language designer in their armchair to decide? Let the users decide what is appropriate. It should be a decision determined by the business situation of the development effort (what programmers do they have, how proficient they are in different languages, whether variable names should be aligned with the standard terminology of the business.)

I think Rust's use of ASCII-only lexemes comes down the fact that it's an experimental language and they're currently trying to avoid complexity in the compiler. (There's no standard library yet and they only added the 'const' keyword a few months ago!) If people want it I'm sure it will be added. As it currently stands Unicode can still be used in comments and strings, etc.

@Dave: What they are doing is, in effect, discouraging experimentation --by non-Latin-using developers! It is what you would expect in the 1980s not in the 2010s.

What is the difference in complexity in allowing any non-ASCII character in a lexeme by default rather than banning them? Basically none, if they are tokenizing using whitespace and punctuation.

If the Rusties have made the dubious decision to have 8-bit chars, they are just setting themselves up for a rewrite and pain down the road when they have to move to UTF-16 (or they could allow UTF-8 and keep their 8-bit assumptions.)

FWIW there's no longer an ASCII-only restriction on identifiers. The new identifier syntax is modelled on that of Python 3, and is a superset of Java and Cs identifier syntax.

Although strings are UTF-8, interestingly the char type is UCS4.

Rust only restricts identifiers to ASCII, in strings and comments one can use full unicode.

In my personal experience Russian programmers do not use Cyrillic letters in the identifiers in Java. And even in JS this is avoided. Sometimes this is dictated as a company policy, sometimes this comes from security concerns. For example, the following JS fragments alerts 1, not 2:

var counter = 1;
++cоunter;
alert(counter);

Can you see the reason why?

(In the example Igor provides, there is an issue with the o, where some transcoding systems may convert this to ++c?unter or somethign else. This is why new languages should start with UTF-8, and fail agressively otherwise when bad encoding signatures are detected.)

I think the argument would be more convincing if Russian programmers coded in ASCII before the end of the cold war :-) Certainly there may be many reasons to *choose* ASCII in various circumstances.

However, as I said, this should not be the choice of the language designer, but of the language's users (company policy, developers, etc like you mention).

I have certainly come across programmers who prefer to use identifiers in a different script from text, since it is visually clearer too. But I have also come across programmers (in China, Japan, and Korea) who choose to use identifier in their own scripts and languages. (I wonder if the issue is particularly acute in ideographic languages, where spelling out the word in latin letters is not necessarily enough of a hint to understand what word is intended.)

Rick, there's no issue with "transcoding" here. What you refer to as "the second o" is in fact NOT the English letter "o", but instead a totally different codepoint. That's right: Unicode (of which UTF-8 is an encoding) provides TWO codepoints (in fact, more than two) for what you think of as the letter "o".

Unicode is *hella* complicated. Consider the following snippet:

a = (b+d); // adds b and d, right?

Now consider the following snippet:

a = (bed); // refers to the variable named "bed", right?

Now consider the following three snippets:

a = (béd); // refers to the variable named "béd", right? Note that é is U+00E9.
a = (béd); // does this refer to a DIFFERENT variable named "béd" (using the combining acute accent U+0301 plus the ASCII letter e), or the same one?
a = (b≠d); // is this the "not equal" operator? a syntax error? or the variable named "b≠d"?
a = (b襌d); // is this really a variable named "b襌d"?

Few of these questions have easy and intuitive answers. The C99 standard made a valiant attempt to deal with Unicode identifiers, basically by creating a huge whitelist of all the "alphabetic-like" and "numeric-like" characters in Unicode, and disallowing things like combining accents... but the C99 work still left a huge number of weird corner cases.

The rule "ASCII letters only" may seem limiting and un-PC to you, but from a programmer's perspective it's very clear and easy to implement. Unicode would open many cans of worms simultaneously for no good reason.

Hi, Rust designer here (graydon).

I realize this is an issue that sets off cultural-imperialism bells; I hope to avoid conveying the worst of that. Indeed, I started out assuming we'd go with "full unicode" identifiers, and only reverted the thing to ASCII when writing that section of the manual and codifying the lexer in some detail, trying to decide what "full unicode" would or should mean in this context, and doing some further research. But I realize that I'm an English-speaker by default and so my assumptions and linguistic privileges may well be bleeding into my decision-making, I'm happy to be told I'm wrong on this matter and follow someone else's data. But let me lay out the research and the reasoning. If you have opposite evidence or practices, I'm all ears. Language is very political and very personal, and I'm absolutely willing to adapt to those who feel more strongly about this than I do.

- Searching the public code archives, I couldn't find any examples of people using the non-ASCII range of identifiers in Java or JS, the two main languages that support "more than ASCII" idenfitiers. I'm happy to be shown such examples.

- Searching the rules in other languages, I find things like Java's rule, which defers to Character.isLetter(), which defers to the five 'L' unicode character classes. As the designers of Go have learned this is is not adequate for the linguistic task they're trying to accomplish. Unicode is a big, complicated concept, and requires careful study to handle linguistic tasks in unicode-aware form. So I actually don't know of a consensus on unicode rules that are "correct" for identifiers; possibly the link James Clark provides is best?

- Lexing octet-at-a-time is really helpful for having a lexer go fast. Lexing is a major cost center in a compiler. Strings and chars can be lexed octet-at-a-time even when carrying unicode because we have octet boundaries. Identifiers would require more work; possibly even normalization forms. Possibly we could isolate that work in cold paths off the 'main' ASCII-focused identifier recognizer; it's not a *strong* argument, but speed is always a consideration in compilation strategies.

- Homographic confusion is a real issue. As Igor points out, once you leave ASCII and enter the full L-range of unicode, it gets difficult to tell how to type something you see elsewhere in code, or whether two identifiers are 'the same'. This is an issue that bites all systems that try to use unicode in a 'symbolic' setting; you should at least understand what nameprep is trying to defend against before deciding what to do here. It's complicated.

- Controlling syntax flamewars is a central problem of doing language development. They're the one thing that spirals out of control in any language development task. Merely discussing changes to semicolon rules or whitespace rules can take up several hundred messages, easily. Even talking about the use of a unicode rightward-arrow cost the Go designers 72 messages of churn.

- I've actually seen non-English programmers react negatively to 'unicode identifiers', calling it as technically worse than useless or, in extreme cases, heard it likened to a sort of cultural tokenism, an appropriation of other language users' writing systems in a medium that doesn't admit any of the complexity they actually face when trying to integrate their linguistic practice with that of the language they wish to write in.

So overall, I decided to go with what I knew would work and seemed well-supported by the evidence I could find. I apologize if this is too crude. If you happen to know of better information about the preferences and activities of non-English programmers, by all means, point me to it. Post to our mailing list or send me links personally or whatever!

@Graydon

Thanks for taking the time to respond!

You are worried that Unicode is complex and requires study and will open a door for mad threads. But the Unicode Consortium's guidelines http://unicode.org/reports/tr31/ are specifically designed so you don't need to go beyond your expertise nor engage in debate. Hundreds of experts from industry, academia and users sorted out something workable last century. You just need to implement section 2 (the part before 2.1 even) which has
<identifier> := <ID_Start> *
If someone says "I think it is wrong" you say "Tell it to the Unicode people, not me".

You mention the IDN RFC: but remember that concerns the difficulty with retrofitting international requirements onto an ASCII legacy: perhaps what you are building for your future! (And the issue of GO and arrows is not an issue of internationalization at all, more a red herring: however, of course Scala allows all sorts of symbols in identifiers.)

The issue is not cultural imperialism, IMHO. The issue is that technology that does not create a level-playing field is not, unfortunately, neutral, but positively creates disabilities.

Doorknobs that are too high create a disability for short people, for example: the problem isn't with the short people, it is with the doorknobs. The short people were doing fine until they come to your door. Boy those short people are damn happy, we might even think. You may indeed find there are big people who deny that short people exist, or say that they would find it condescending to have doorknobs that are in reach for them, or that they just need to get better jumping skills, or that life is tough, and so on.

If you look at, eg http://inabrenner.de/pdf/SCJP.pdf you can see that the identifiers used are German. In the comments they use the full German alphabet, in the code they use, presumably for prudence, ASCII only. That is nice for German, because it has ASCII conventions (ue for u umlaut and so on.) But that they want to use their native language is beyond doubt, that they have no trouble typing it in, or could cut and paste is beyond doubt (foreigners can cut and paste).

I have worked in Japan and Taiwan. I have seen that programmers use their own language and characters in identifiers when they have the opportunity, depending on their preferences. And I know that it is important for teaching students, but that the more that someone is familiar with English (or has to use crapulous ASCII only software or has a foreign market) the more that they may *choose* to use English. It should be their choice, not yours: if you think it is condescending to allow someone to use their language, please write all your code in Chinese (assuming you don't know much Chinese) or in some writing system you only barely know from now on and see if you still have that opinion ;-)

I am not sure about the link about James Clark you mention. I have worked with James for almost 20 years, through SGML and XML days. I instigated a project that fed into the XML 1.0 naming rules. Please don't let the arcane discussions become an excuse: the Unicode rules are a fine thing to adopt.

On the availability of non-ASCII in public code libraries, I think you will find that people are very polite in not putting out code in languages that other people won't be able to read. You might like to extend the courtesy the other way...

[[It strikes me that you may not be aware that some languages have words that are not in others. English is large and pretty good in that regard. But, for example, we have no exact word for the Japanese "chome" as used in addresses. Is it district, or is it block (and if it is block, what is "ban"?) The more that local concepts are being expressed, the more that local names are useful.]]

I appreciate the comments. I continued fishing yesterday for advice in our community on this matter, and was eventually directed to the threads of discussion surrounding the same issue when it came up in the recent python 3000 design effort, in particular PEP 3131, which lays out a lot of more-nuanced argument and discussion, and concludes in favour of UAX 31, as you recommend here (and as James Clark recommended! I wasn't being facetious in citing him, I was considering his recommendation in the Go BTS as the clearest advice on post-Java-unicode-mistakes I've seen yet.)

Given that degree of consensus (yours, James', the majority opinion on PEP 31), I'm completely willing to switch the lexer over to one of the UAX 31 rules. I'm not sure whether R1 or R2 makes more sense, but one of those approaches. Clearly I did not look deep enough into the issue, last time around.

It'll be a little while before we have the person-power available to implement that in full in the self-hosted compiler (we have not implemented proper UTF-8 support *anywhere* it yet; only the bootstrap one understands it so far) but rest assured I have every intention of satisfying the normal needs of non-English programming shops. I just had a misunderstanding of what counts as "normal" in those contexts, given a too-limited sampling of opinions. Thanks for the correction.

Fwiw, I've updated our docs to reflect this advice, removed the now-obsolete FAQ entry concerning ASCII-range identifiers, and filed a bug against rustc to finish off this portion of the lexer once rustc has stabilized a bit more (it can't even lex floating point literals at the moment).

Please let me know if you've further input on adapting the language to the needs of non-English programmers. I'd very much like to accommodate a wide audience.

News Topics

Recommended for You

Got a Question?