logo

Vocaloids

October 12th, 2011

Vocaloid is a singing synthesizer application, with its signal processing part (concatenative synthesis) developed through a joint research project between the Pompeu Fabra University in Spain and Japan’s Yamaha Corporation, who developed the software into a commercial product. Vocaloid enables users to synthesize singing by typing in lyrics and melody. The main parts of the Vocaloid  system are the Score Editor, the Singer Library and the Synthesis Engine. The project started in 2000, the first commercial Vocaloid version was presented by Yamaha at the Musikmesse in Germany in 2003 and the Vocaloid version 3 was launched in October 2011.

Each Vocaloid is sold as “a singer in a box” designed to act as a replacement for an actual singer. Today seven studios are involved with the production and distribution of Vocaloids, among them are three studios creating english Vocaloids, the other four are solely creating Japanese Vocaloids.

  • Zero-G (english virtual vocalists) : Zero-G Limited was founded in 1990, trading under the name Time+Space, by Ed Stratton and Julie Stratton.  Zero-G  rapidly became the largest distributor of soundware in the UK and one of the most critically acclaimed sound developers in the world.
  • Power-X (english virtual vocalists) : PowerFX is a small recording company, based in Stockholm, Sweden. The company has been producing music samples, loops and sound effects since 1995.
  • Crypton Future Music (japanese and english virtual vocalists) : Crypton, is a media company based in Sapporo, Japan, created in 1995. It develops, imports, and sells products for music, such as sound generator software, sampling CDs and DVDs, sound effect and background music libraries.
  • Internet Co. Ltd. (japanese virtual vocalists) : Internet Co.  is a software company based in Osaka, Japan. It is best known for the music sequencer Singer Song Writer and Niconico Movie Maker for the video sharing website Nico Nico Douga.
  • AH Software (japanese virtual vocalists) : AH-Software is the software brand of AHS Co., Ltd., an importer of digital audio workstations and encoders in Tokyo, Japan. It is also known as the developer of Voiceroid, a speech synthesizer application only available in the Japanese language.
  • Bplats (japanese virtual vocalists) : Bplats, Inc. is an application service provider (ASP) based in Tokyo, Japan. The company offers Software as a Service (SaaS) and Platform as a Service (PaaS) solutions, such as the Vocaloid series VY1 and a Vocaloid online shop.
  • Ki/oon Records (japanese virtual vocalists) : Ki/oon Records is a Japanese record label, a subsidiary of Sony Music Japan.

Hatsune Miku (Crypton)

Kagamine Rin & Len (Crypton)

Leon (Zero-G)

Sonika (Zero-G)

Big AL (PowerFX)

Nekomura Iroha (AH-Software)


A complete list of the Vocaloid products is available at the Wiki website.  The marketing of  the Vocaloids is done by the studios.

Just like any music synthesizer, the software is treated as a musical instrument and the vocals as sound, belonging to the software user. The mascots for the software can be used to create vocals for commercial or non-commercial use as long as the vocals do not offend public policy. On the other hand, copyrights to the mascot image and name belong to their respective studios and can not be usedd without the consent of the studio who owns them.

There are a number of derivative products, for example Vocaloid-Flex, Vocal Listener, Miku Miku Dance, Project Diva and MMDAgent. An online Vocaloid service (NetVocaloid)  in English and Japanese is available at the Y2 Project website.

The following virtual vocalists are the most famous :

A number of figurines and plush dolls were released for some of these singers, some have their own Twitter, Facebook and MySpace accounts.

In Japan, Vocaloids have a great cultural impact and lead to a lot of legal implications. Vocaloid music is available on CD’s, iTunes, AmazonMP3 etc. Open air concerts with virtual vocalists have been organized recently with great success :

  • 1st live concert (Animelo Summer Live) : August 22, 2009, Saitama Super Arena, Saitama, Japan
  • 2nd live concert (Mikufes 09) : August 31, 2009,
  • 1st overseas concert (Anime Festival Asia) : November 21, 2009, Singapore
  • 3rd live concert (Miku no Hi Kanshasai 39′s Giving Day) : March 09, 2010, Odaibo, Tokio, Japan
  • 1st american live concert : September 18, 2010, San Francisco, USA
  • Vocarock Festival : January 11, 2011
  • Vocaloid Festa : February 12, 2011
  • 4th live concert : March, 9, 2011, Tokio, Japan
  • 2nd american live concert : October 11, 2010, Viz Cinema, San Francisco, USA; screening in the New York Anime Festival
  • 3rd american live concert (Mikunopolis) : July 2, 2010, Nokia Theater, Anime Expo, Los Angeles, USA

During the concerts, 3D animations of the Vocaloid mascots are projected on a transparent screen giving an effect of  a pseudo-hologram. Videos of different Vocaloid concerts are available at the following Youtube playlist.

A similar software as Vocaloids, developped by Ameya/Ayame, is called UTAU and has been released as freeware. Cracked copies of Vocaloids are called Pocaloids.

Microsoft Tellme

October 7th, 2011

Microsoft Tellme simplifies everyday tasks with the natural power of your voice. You can talk to your PC, tablet, phone, TV or car.

The results of the Microsoft Tellme technologies “Say it. Get it” are speech recognition and synthesis capabilities in products ranging from Xbox Kinect for fun to Microsoft Tellme IVR for customer care to Windows Phone 7 for life and work.

In Windows 7 you can use voice recognition to control your computer and to dictate and edit text. A guide how to set up your computer for this task is available at the microsoft website.

The provided technologies for business applications are Microsoft Tellme IVR and embedded speach features in Office, Lync and Exchange . Different platforms are available : cloud, server, desktop, phone.

To extend the built-in speech recognition functionality included in Windows on desktop, you can use Windows Speech Recognition Macros or, for more advanced uses, the Microsoft Speech API (SAPI).

SAPI has been an integral component of all Microsoft Windows versions since Windows 98. Microsoft Windows XP and Windows Server 2003 include SAPI version 5.1. Windows Vista and Windows Server 2008 include SAPI version 5.3, while Windows 7 includes SAPI version 5.4. Code written for SAPI 5.3 (Vista) will run on SAPI 5.4 (Windows 7) without recompiling.

Playlist formats m3u and pls

July 20th, 2010

M3U is a file format to store multimedia playlists. It was first used by Winamp. PLS does the same, but is a more expressive format than basic M3U, as it can store  information on the song title and length (this is supported in extended M3U only). With PLS version 2, playlists also include a PLS version declaration.

iTunes, QuickTime Player, Real player, Winamp, XBMC, XMPlay, VLC media player and manyother programs play PLS files without any extra codecs.

Google Text-to-Speech (TTS) support

July 13th, 2010

Last update : 30 April 2011

On november 16th, 2009, Google announced on their official blog that english text-to-speech was added to the translation tools.  Google used eSpeak, which is an open source software speech synthesizer for this service.

In may 2010,  Google Translate added more audio translations languages, including Afrikaans, Albanian, Catalan, Chinese (Mandarin), Croatian, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Haitian Creole, Hindi, Hungarian, Icelandic, Indonesian, Italian, Latvian, Macedonian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Spanish, Swahili, Swedish, Turkish, Vietnamese and Welsh.

The speech audio is in MP3 format and is queried via a simple HTTP GET (REST) request. For english, an example url is:

http://translate.google.com/translate_tts?tl=en&q=how are you?

The TTS web service is restricting the text to 100 characters and the service returns 404 (Not Found) if the request includes a Referer header.

December 3, 2010, Google acquired Phonetic Arts, a company specialised in speech synthesis. Phonetic Arts Limited delivers technology that generates natural expressive speech. The products include Phonetic Morpher,  Phonetic LipSync  and Phonetic Synthesizer. Phonetic Arts, formerly known as Tayvin 356 Limited, was founded in 2006 and is based in Cambridge, UK.  The Phonetic Arts technology generates natural computer speech from small samples of recorded voice and should improve the voice output quality of Googles text-to-speech applications.

Google does not only provide speech output tools, but also speech input tools (Voice Search, Voice Input, Voice Actions), mainly in relation with the mobile phone OS Android.

Version 11 of the Google Chrome browser includes the HTML5 Speech Input API.

An amusing application of the Google TTS system is the Google Translate Beatbox.

Dewplayer : lecteur mp3 en flash

December 28th, 2009

Alsacréations, une agence web à Strasbourg en Alsace, spécialisée dans la conception de sites internet conformes aux standards internationaux W3C, offre depuis plusieurs années un lecteur audio mp3 en Flash par Dew, simple à installer et à utiliser.

Appelé Dewplayer, ce lecteur est distribué sous licence Creative Commons, son utilisation est libre et gratuite même dans un cadre professionnel ou commercial.

Un générateur de code XHTML est disponible sur le site qui va produire un code à copier-coller selon les besoins des usagers. L’utilisation de swfobject est recommandée pour l’intégration du lecteur.

Le pilotage du lecteur par Javascript est possible et il y a de nombreuses options disponibles. J’utilise le lecteur depuis des années avec succès. La version la plus récente est 1.9.6.

SoundFonts (.sf2)

December 9th, 2009

SoundFont, a registered trademark of E-mu Systems, Inc., is a name that collectively refers to a file format and associated technology to synthesize audio in the context of computer music composition. The exclusive license for re-formatting and managing historical SoundFont content has been acquired by Digital Sound Factory.

A SoundFont file, or SoundFont bank, contains one or more sampled audio waveforms (or samples), which can be re-synthesized at different pitches and dynamic levels. SoundFont banks are related to MIDI devices and can be seamlessly used in place of General MIDI (GM) patches in many computer music sequencers.

The original SoundFont file format was developed in the early 1990s by E-mu Systems and Creative Labs (used in Sound Blaster AWE32). Files in this format conventionally have the file extension of sbk. The SoundFont 2.0 version was released in 1996 and was fully disclosed as a public specification to make it an industry standard. New versions up to 2.4 have been relased in the past years and the new SoundFont files conventionally have the file extension sf2.

There are other sound formats available, e.g. The DownLoadable Sounds (DLS) standardized by the MIDI Manufacturers Association (MMA),  the DLS-Level 2 and the Structured Audio Sample Bank Format (SASBF )standardized by he MPEG standards body in collaboration with MMA and MIT and  proprietary formats developed by Yamaha and other music companies. Nevertheless the sf2-soundfonts became a de-facto standard and are widely used today.

There are a lot of websites available that offer free and commercial sf2 soundfonts :

The following tools are best suited to use SoundFonts :

  • SynthFont : a free midi file player using SoundFonts
  • Viena : a free SoundFont editor
  • FluidSynth : an open source real-time software synthesizer used in several music applications
  • Gervill : a software sound synthesizer for use with the Java Sound API
  • SFPack and SFArk : archivers for SoundFont banks which use different compression techniques

VoicePHP : build voice enabled applications directly in PHP without any 3rd party APIs

November 8th, 2009

VoicePHP is not an extension to PHP; infact it’s the same PHP which now outputs voice instead of text and also takes input as voice instead of text. In technical terms, it’s PHP whose standard text based input & output (stdio, stdout in programmer’s term) are replaced by voice equivalent.

VoicePHP diagram

VoicePHP diagram

VOXEO hosting platform

December 30th, 2008

Voxeo offers three main application platforms for free to developers: CallXML, CCXML, and VoiceXML.

The Prophecy 8.0 – CallXML 3.0 platform allows developers to build robust IVR applications using only static content. CallXML suits the needs of most telephony applications that use touchtone input (DTMF).

The Prophecy 8.0 – CCXML W3C 1.0 platform allows to deploy the next-generation conferencing/call routing applications to ensure that they will stand the test of time.

The Prophecy 8.0 – VoiceXML 2.1platform includes the Voxeo ASR engine (available only in US English) and is the world’s first and only 100% certified-compliant VoiceXML 2.0 browser. The Prophecy Platform supports all the VoiceXML 2.1 additions and enhancements, as well as the SISR/SRGS grammar formatting standards. It also includes legacy support for the older GSL and JSGF grammar formats.

I created an account a few years ago to do my first trials with VoiceXML. Today I updated the account and started with a new HelloWorld test application. The telephone numbers to access the application are the following:

  • Skype VoIP :  +99000936 9991425592
  • SIP VoIP :  sip:9991425592@sip.voxeo.net
  • iNum Number : +883510001801392

iNum Number from Luxembourg : +352 20880108 p 883510001801392

If calling from a mobile phone (for instance with a BlackBerry) to an iNum number, you have to insert a pause between the local number and the iNum number by using the menu during the number editing.

My first iNum call to the HelloWorld application was succesfully established today at 21h21 with my mobile phone.

Sphinx-4 : a Java speech recognizer

June 20th, 2005

Sphinx-4 is a state-of-the-art speech recognition system written in Java. It was created via a joint collaboration between the Sphinx group at Carnegie Mellon University, Sun Microsystems Laboratories, Mitsubishi Electric Research Labs (MERL), and Hewlett Packard (HP), with contributions from the University of California at Santa Cruz (UCSC) and the Massachusetts Institute of Technology (MIT).

Sphinx-4 contains the following demo programs :

  • Hello World Demo: a command line application that recognizes simple phrases
  • Hello Digits Demo: a command line application that recognizes connected digits
  • Hello N-Gram Demo: a command line application using an N-gram language model for speech recognition
  • ZipCity Demo: a Java Web Start technology application that recognizes spoken zip codes and locates the associated city and state
  • WavFile Demo: a simple demo program to show how to decode audio files (e.g., .wav, .au files)
  • Transcriber Demo: a simple demo program showing how to transcribe a continuous audio file that has multiple utterances separated by silences
  • JSGF Demo: a simple demo program showing how a program can swap between multiple JSGF grammars
  • Dialog Demo: a demo program showing how a program can swap between multiple JSGF and dictation grammars
  • Action Tags Demo: a demo program showing how to use action tags for post-processing of RuleParse objects obtained from JSGF grammars
  • Confidence Demo: a simple demo program showing how to obtain confidence scores for result
  • Lattice Demo: a simple demo program showing how to extract lattices from recognition results

A number of tests and demos rely on having JSAPI installed. Sphinx-4 can be combined wit FreeTTS to set up a complete voice interface or a VoiceXML server.

FreeTTS : a Java speech synthesizer

June 19th, 2005

FreeTTS is a speech synthesis system written entirely in Java. It is based upon Flite, a small run-time speech synthesis engine developed at Carnegie Mellon University. Flite is derived from the Festival Speech Synthesis System from the University of Edinburgh and the FestVox project from Carnegie Mellon University.
Free TTS was built by the Speech Integration Group of Sun Microsystems Laboratories.

Possible uses of FreeTTS are:

  • JSAPI (Java Speech API) speech synthesizer
  • Remote TTS Server, to act as a back-end text-to-speech engine that works with a speech/telephony system, or does the “heavy lifting” for a wireless PDA
  • Workstation/Desktop TTS engine
  • Downloadable Web Application (FreeTTS can not be used in an applet)

FreeTTS includes the following demos :

  •  JSAPI/HelloWorld: uses the JSAPI 1.0 Synthesis interface to speak “Hello, World”
  • JSAPI/MixedVoices: demonstrates using multiple voices and speech synthesizers in a coordinated fashion using JSAPI 1.0
  • JSAPI/Player: Swing-based GUI (graphical user interface) that allows the user to monitor and manipulate a JSAPI 1.0 Speech Synthesizer
  • JSAPI/JTime: JSAPI program that uses a limited-domain, high quality voice to tell the time
  • JSAPI/Emacspeak: uses JSAPI 1.0 to provide a text-to-speech server for Emacspeak
  • JSAPI/WebStartClock: JSAPI talking clock that can be downloaded from the web using Java Web Start
  • freetts/HelloWorld: low-level (non-JSAPI) program that speaks a greeting to the world
  • freetts/ClientServer: low-level (non-JSAPI) socket-based TTS server with sample clients written in the C programming language and the Java programming language.

To write software with FreeTTS, it is recommended to use the Java Speech API (JSAPI) 1.0 to interface with FreeTTS. The JSAPI interface provides the best method of controlling and using FreeTTS.

Currently, the FreeTTS distribution comes with these 3 voices:

  • a low quality, unlimited domain, 8kHz diphone voice, called kevin
  • a medium quality, unlimited domain, 16kHz diphone voice, called kevin16
  • a high quality, limited domain, 16kHz cluster unit voice, called alan

FreeTTS interfaces with the MBROLA synthesizer and can use MBROLA voices. It’s also possible to import voice data from Festival and FestVox or CMU ARCTIC voices.

A full implementation of Sun’s Java Speech API for Windows platforms, allowing a large range of SAPI4 and SAPI5 compliant Text-To-Speech and Speech-Recognition engines (in many different languages) to be programmed using the standard Java Speech API has been developped by CloudGarden. Packages and additional classes augment the capabilities of the JSAPI by, for example integrating with Sun’s JMF, allowing, amongst other things, MPEG audio files to be created and read, and compressed audio data to be transmitted across a network