Schedule

Martin J. Dürst
Martin J. Dürst
  • @duerst

Martin is a Professor of Computer Science at Aoyama Gakuin University in Japan. He has been one of the main drivers of Internationalization (I18N) and the use of Unicode on the Web and the Internet. He published the first proposals for DNS I18N and NFC character normalization, and is the main author of the W3C Character Model and the IRI specification (RFC 3987). Since 2007, he and his students have contributed to the implementation of Ruby, mostly in the area of I18N.

Squeezing Unicode Names into Ruby Regular Expressions

This talk discusses the future of Ruby regular expressions. Ruby allows matching characters with many Unicode properties. The 'name' property is special, and requires special treatment. Unicode character names are different for each character and up to 80 or more characters long.

We show how we can use the structure of the names to produce a compact representation of the data that can be efficiently searched. The solution relies on tries and radix trees as data structures, and care to try to use every single bit of memory. We compare memory requirements and speed with implementations for other languages such as Python, Perl, and Java.