l10nSl - SL lokalizacija: Thar She Blows - Quantitative Look at the Localize.Mozilla.org site

Introduction

Note: this is a draft version of the article and will be changed resp. expanded in the future. stay in touch;) The intention of this communication is look into the rear-view mirror, presented at this site and learn about how we get the work done.

A number of localization projects - for instance the SuMo, AddOns, MDN to name the biggest - is running via Localize/Verbatim site:

Here's the basic statistics of the work in progress:

active projects: 13
languages involved: 85
number of active l10n projects: 335

The data, discussed and analyzed, have been culled from this site. I appologize in advance for possible inconsistencies. Note also this site is alive and kicking and has been changing all the time.Still I do not think they would change the results and the over-all impression presented here.

About projects

Here's the diagram of running projects (sorted by percent finished):

Fig 1: Degree of completion - overall in %

The bars above are all scaled to the same total size of 100% on the web site. Using a different metrics, i.e. the number of words involved, the situation looks quite different:

Fig 2: Degree of completion - overall in words

This is a much more relevant point of view - after all we localize words and not percents. Note that Addons Mozilla (325k) and Support Mozilla (190k) together have the amount of red twice the size of green in the rest of the projects.

For every project and language involved the date and time of last access is also recorded. Below the amount of unfinished work in words vs date of last access:

Fig 3: Amount of unfinished work vs date of last access

For 41% of the unfinished business the date of the last access is unknown. However, given the level of "unfinished" for these records one can safely assume they have never been really active. The following diagram shows the average levels of project completion for the four date/time slots:

Fig 4: Percent unfinished vs date of last access

The "unknown" slot (leftmost column) could indeed be named "never". Quite a different story the other three time slots: they paint a much greener picture. The rising trend shows the site is alive: fresh new projects turn up to be localized, while older projects may be getting a fresh layer of paint.

About languages

The above analysis gives an overall picture. What can we say about different languages and their record? The analysis below is based on the following data for each language

the number of ongoing projects
the total number of words in the portfolio
the average completion level

Alltogether 335 records for 85 languages (Aafrikans - White Russian) X 12 projects (Addons - Webify me) have been clustered using Ward's method:

Fig 4: dendrogram of active languages

There's clearly at least four clusters present, so the decision was to ask for 6 clusters, which turned out to be a pretty good guess. Based on their average properties, the six clusters were named as follows:

cluster	# languages
Window shoppers	32
Tyre kickers	11
The cautious	7
VTOLs	12
Working class heroes	15
The awesome	8

Let us have a closer look at the suggested clusters:

Window shoppers: completion level ~ 1%, number of project 1-2, of small sizes. The favourite project in this group is Bugzilla Component Descriptions with 121 words to be translated.
Tyre kickers: they have on average 4-5 projects, completed 25% on average, and working on a portfolio half the possible maximum size.
The cautious: one would be tempted to include them in the window shoppers' group (1-2 small size projects), but their average completion level is at respectable 94%. One can expect they will come back for more.
VTOLS: they have 3-4 projects of ~20% of the total, with the completion level close to 50%. They are taking off.
Working class heroes: a group with distinctly higher number of projects (6-7) , average portfolio size above 50% and an attractive average completion level of ~70%.
The awesome: they have ~10 projects active, amounting to close to 90% of the total portfolio, with the average competion level of 92%.

Note: for the curious, who would want to know, which cluster their language falls into:

check the number of active projects for your language
check their total size, expressed as percent of the total portfolio size (57313)
check you over-all competion level
consult the description above

The difference between clusters goes deeper and it involves the internal workings of a given localization group and its mode of operation. More on that later.

Who done it!? - about language teams

The analysis in this section is based on on-line activity log, available for every language (example SL - AMO project):

Fig 5:example of the on-line activity log

The totals for the language group can serve as an indicator for the on-line activity of the language group respectively the individual. Note that verbatim site serves also as a VCS gate and the batch-mode activity of translators is not recorded here - as we will see, this can be gauged only indirectly.

Fig 6: average on-line activity (left) and number of projects (right) for the clusters

The top two clusters have significantly higher on-line activity, both in terms of events (left) as well as in the number of projects, active on-line.

Behind all these numbers are people - altogether ~460 participants. Let us assume there's two projects running for the XY language, with Alice doing the majority of the work for project #1 and Bob for project #2. Their share of work is for instance 80-20 for Alice, and 30-70 for Bob. The maximum score (80% in this case, or rather 100-80 = 20%) is a measure of collaboration within the language group and is shown below for the language clusters:

Fig 7: cooperativity level within the language clusters

In case of the four clusters on the left it does not make much sense to talk about cooperativity, due to the number of projects involved (1 or 2, maximum 4 for tyre kickers). The two clusters on the right hand side indicate however a growing level of cooperation, accompanied by the specialization and spreading of tasks.

Verbatim is just a part of the Mozilla l10n landscape!

Here's an off-the-cuff summary of the localizeable materrial within Mozilla:

Fig 8: Mozilla l10n segmentation

The material discussed in this article is about a quarter of the total task. It is also second to the product localization. This all can mean different things to different people / clusters:

Window shoppers: they are for sure involved with the product localizaction, at least with localizing Firefox. Their level of completion can not be much different there.
Tyre kickers: ditto, possibly first kicking visits ton TB and SuMo support articles
The cautious: they may branch out - or may have already done so
VTOLS: make come unexpectedly at the top some time soon
Working class heroes: trying to do all of it with a mixed level of success
The awesome: doing it all and succeeding, but teething pains

The immediate problem, facing anybody branching out of any given segment in the Fig 8, is the change of the working environment:

segment	repository env.	file types
Mozilla SL - aurora	HG	Dtd, Properties
Verbatim	CVS (verbatim)	Po
SuMo articles	Django online	UTF8 + syntax
Web contents	SVN	Html

The localizer, whose qualification should above all be of a lingustic nature, should thus also be a wizzard in all kinds of incompatible environments. There's not many of such kind in any locale, and below some critical language size I am afraid there's none. Of course tyre kicking and window shopping does not require that much speciation...

Conclusions

Every one of the described language group types has its own specific immediate problems, with the common one of "too much work and not enough time and manpower. Here's my personal suggestions

Window shoppers, tyre kickers

It is easy to make a commitment, but the road gets more and more slippery with the time spent on the job. They should focus on short projects - even if slightly ridiculous, Bugzilla Component Descriptions is perfect project to kick off, with Mozillians and Affiliates next on the list. There should be more of projects like this.

The cautious, VTOLs

I am sure that at least the cautious will not jump right into - say - Mozilla AddOns. Some guidelines on what gives a localizers more bang for their bucks would be welcome. Support articles are a good example for that: the number of articles is humungous by now, but doint the first 10 or 20 covers a lot of ground. Unfortunately this is not so in some other cases - see Fx, TB, Amo; MDN projects.

Working class heroes

They should look for fresh members in the l10n team and for productivity tools. Way to higher levels in any case involves building a team and division of tasks, plus a high level of cooperativity among the members of the team. "Owning" some project is a sure way down into home-made problems.

The awesome

The localizers need to hear their story. And Mozilla should listen closely when they raise their hand.

Notes and comments

about half a dozen of among active contributors has a mozilla.com email address. I have ignored two of them lymanknap and smalolepszy@mozilla.com for obvious reasons. The rest (eakhgari, fwenzel, pfinch and yliu) I left in the list. Their contribution is maybe negligible but not to be ignored.
some contributors turn up under different names (for instance besnik@programeshqip.org and besnikm). I have not tried to recode them.
what is "the total size of portfolio"? To use the same metric for all projects, the total size is the verbatim total size and not the total size of projects, active for a given language grup.
segment sizes (Verbatim is just a part of the story): as per OmegaT stats summaty. aurora does not include Seamonkey. SuMo includes Fx support, but not TB support articles

l10nSl - SL lokalizacija

Zaznamki

Sonntag, 6. November 2011

Thar She Blows - Quantitative Look at the Localize.Mozilla.org site