Grapheme-to-Phoneme (G2P) conversion is the process of translating written characters (e.g., "hello") into phonemes that indicate how to pronounce the word (e.g., HH AH L OW or HH EH L OW).

English has so many quirks that state-of-the-art performance has error rates as high as 20%–30%. See, for example, CMU Sphinx's g2p-seq2seq. I'm currently working on a small and embeddable G2P system.

An interesting application of G2P is finding words that have surprising pronunciations.

Just for fun, below is a list of the 250 words that my G2P system found most surprising.

  1. worcestershire
  2. kalthoff
  3. versailles
  4. dicesare
  5. interacciones
  6. natchitoches
  7. peugeot
  8. krawccykiewi
  9. thereof
  10. mackiewicz
  11. guillermo
  12. hors-d-oeuvre
  13. schiewe
  14. beyonce
  15. dachshunds
  16. colonels
  17. mckrinkowski
  18. laboratoires
  19. wojnarowski
  20. dorotea
  21. thereto
  22. worcester
  23. worthog
  24. paranoiac
  25. naivete
  26. faribault
  27. rendezvous
  28. jean-bertrand
  29. descartes
  30. polyurethane
  31. moszkowski
  32. girardot
  33. self-destruction
  34. signore
  35. paraphernalia
  36. flaharty
  37. algorithmic
  38. jerrome
  39. macgowan
  40. genuineness
  41. henault
  42. borchardt
  43. pleuritides
  44. hillesheim
  45. bischoff
  46. bialkowski
  47. amirault
  48. cesare
  49. caltagirone
  50. javier
  51. kuenheim
  52. valkyrie
  53. muhamed
  54. facemire
  55. clotheshorse
  56. ahasuerus
  57. celestine
  58. wereldhave
  59. cliched
  60. jorge
  61. cortese
  62. teicholz
  63. girouard
  64. garages
  65. garage
  66. blancmange
  67. deandrade
  68. aristophanes
  69. mankiewicz
  70. achmed
  71. faciane
  72. bouygues
  73. peyote
  74. okurowski
  75. lagniappe
  76. deringer
  77. aleshire
  78. lincolnshire
  79. sayed
  80. lepere
  81. kumbaya
  82. mothershed
  83. voyeur
  84. cieslewicz
  85. paranoia
  86. arroyo
  87. wickwire
  88. jaroszynski
  89. maret
  90. barrage
  91. prometheus
  92. polyhemoglobin
  93. samurais
  94. minjares
  95. kaweske
  96. cherumirdan
  97. double-entendre
  98. extraterrestrial
  99. daigrepont
  100. charlemagne
  101. chaim
  102. khaled
  103. genre
  104. dubreuil
  105. chappelear
  106. sangiovese
  107. asthmatics
  108. construcciones
  109. carrasquillo
  110. kucewicz
  111. pernod
  112. ptolemaic
  113. iseminger
  114. palmstierna
  115. rapprochement
  116. buenos-aires
  117. cohenour
  118. reichart
  119. sawicz
  120. jean-michele
  121. neuroscientist
  122. neuroscience
  123. bolognese
  124. dachshund
  125. unnecessary
  126. pleomorphism
  127. popolare
  128. loverde
  129. mozartean
  130. contrapunction
  131. dioceses
  132. francaises
  133. generales
  134. preadolescence
  135. mazzorana
  136. jolla
  137. ballet
  138. cuisinart
  139. thereafter
  140. gutierres
  141. grosvenor
  142. balcerowicz
  143. borocce
  144. periodically
  145. therapeutically
  146. brobdingnagians
  147. brobdingnagian
  148. financiera
  149. leccese
  150. chippendales
  151. marciante
  152. keresztes
  153. conscientiously
  154. terrebonne
  155. wiseguy
  156. avelar
  157. parenthetically
  158. macewan
  159. janachowski
  160. entourages
  161. lefebure
  162. jarislowsky
  163. victorine
  164. alvares
  165. ferruzzi
  166. nieves
  167. syracuse
  168. quebecoise
  169. morones
  170. jasinowski
  171. entourage
  172. calmes
  173. etheljean
  174. sans-culottes
  175. orestes
  176. prudentialbache
  177. palimpsest
  178. perniciaro
  179. monteverde
  180. nithuekan
  181. pelczar
  182. algodones
  183. burciaga
  184. adhering
  185. allender
  186. vaquera
  187. vasques
  188. storaska
  189. cifuentes
  190. uemura
  191. boucher
  192. heuristic
  193. abiquiu
  194. berlascone
  195. almaguer
  196. conjurer
  197. marquis
  198. karanicki
  199. maraline
  200. oneyear
  201. gasiorowski
  202. pharisaism
  203. aherin
  204. erlanger
  205. parliamentary
  206. anteriormost
  207. paratore
  208. decesare
  209. laliberte
  210. parfums
  211. garzarelli
  212. lunceford
  213. marylebone
  214. unprofor
  215. verdone
  216. turrentine
  217. daiquiri
  218. barrientes
  219. caribbean
  220. dercole
  221. dearborn
  222. deardourff
  223. linares
  224. senor
  225. sarine
  226. algebraically
  227. corzine
  228. mertes
  229. beautifully
  230. macmahon
  231. bridgham
  232. spagnuolo
  233. paraplegia
  234. reeducation
  235. bialecki
  236. juicier
  237. villafuerte
  238. becerril
  239. gentlest
  240. shingler
  241. alegre
  242. realigning
  243. compaore
  244. reigniting
  245. kawecki
  246. caradine
  247. wyden
  248. lamere
  249. nettesheim
  250. surace

Published 15 May 2019 by Benjamin Johnston.