Conversión de memorias de texto tabulado a TMX con tikal de Okapi

Introducción

En esta entrada explicaremos cómo transformar una memoria de traducción que esté en formato de texto tabulado a tmx utilizando tikal de Okapi Tools (http://okapi.opentag.com/). Con las herramientas de Okapi se pueden realizar infinidad de tareas relacionadas con la traducción y la localización, entre ellas la tarea que nos ocupa.

Recordemos que una memoria de traducción en formato de texto tabulado tiene el siguiente aspecto:

segmento_lengua_A tabulador segmento_lengua_B.

Si la lengua A es inglés y la lengua B es español este mismo segmento en TMX tendría el siguiente aspecto:

tmx

Esta transformación nos permitirá utilizar las memorias de traducción en herramientas de traducción asisitida.

Transformación con Tikal

Si ejecutamos tikal sin ningún parámetro adicioal nos muestra todas sus instrucciones:

tikal1

Si nos fijamos bien (haciendo clic sobre la imagen se ampliará) existe una opción -2tmx que nos servirá para realizar la conversión a este formato. Si miramos bien los parámetros que nos pedirá veremos lo siguiente:

tikal2

Además de indicar la lengua de partida (-sl) y la de llegada (-tl) será muy importante dar el valor adecuado al parámetro -fc que es el parámetro que controla el formato del ficher de entrada. Tenemos que darle un valor de entre la lista dfe valores posibles. Esta lista la podemos obtener escribiendo:

tikal -listconf

Que nos devolverá la lista de configuraciones:

D:\okapi>tikal -listconf
-------------------------------------------------------------------------------
Okapi Tikal - Localization Toolset
Version: 2.0.23
-------------------------------------------------------------------------------
List of all filter configurations available:
 - okf_txml = Wordfast Pro TXML documents
 - okf_txml-fillEmptyTargets = Wordfast Pro TXML documents with empty targets fi
lled on output.
 - okf_itshtml5 = Configuration for standard HTML5 documents.
 - okf_doxygen = Doxygen-commented Text Documents
 - okf_wiki = Text with wiki-style markup
 - okf_mosestext = Default Moses Text configuration.
 - okf_tradosrtf = Configuration for Trados-tagged RTF files - READING ONLY.
 - okf_rainbowkit = Configuration for Rainbow translation kit.
 - okf_rainbowkit-package = Configuration for Rainbow translation kit package.
 - okf_rainbowkit-noprompt = Configuration for Rainbow translation kit (without
prompt).
 - okf_mif = Adobe FrameMaker MIF documents
 - okf_archive = Configuration for archive files
 - okf_transifex = Transifex project with prompt when starting
 - okf_transifex-noPrompt = Transifex project without prompt when starting
 - okf_xini = Configuration for XINI documents from ONTRAM
 - okf_xini-noOutputSegmentation = Configuration for XINI documents from ONTRAM
(fields in the output are not segmented)
 - okf_xliff = Configuration for XML Localisation Interchange File Format (XLIFF
) documents.
 - okf_openxml = Microsoft Office documents (DOCX, XLSX, PPTX).
 - okf_openoffice = OpenOffice.org ODT, ODS, ODP, ODG, OTT, OTS, OTP, OTG docume
nts
 - okf_simplification = Configuration for extracting resources from an XML file.
 Resources and then codes are simplified.
 - okf_simplification-xmlResources = Configuration for extracting resources from
 an XML file. Resources are simplified.
 - okf_simplification-xmlCodes = Configuration for extracting resources from an
XML file. Codes are simplified.
 - okf_properties = Java properties files (Output used \uHHHH escapes)
 - okf_properties-outputNotEscaped = Java properties files (Characters in the ou
tput encoding are not escaped)
 - okf_properties-skypeLang = Skype language properties files (including support
 for HTML codes)
 - okf_properties-html-subfilter = Java Property content processed by an HTML su
bfilter
 - okf_dtd = Configuration for XML DTD documents (entities content)
 - okf_html = HTML or XHTML documents
 - okf_html-wellFormed = XHTML and well-formed HTML documents
 - okf_po = Standard bilingual PO files
 - okf_po-monolingual = Monolingual PO files (msgid is a real ID, not the source
 text).
 - okf_regex = Default Regex configuration.
 - okf_regex-srt = Configuration for SRT (Sub-Rip Text) sub-titles files.
 - okf_regex-textLine = Configuration for text files where each line is a text u
nit
 - okf_regex-textBlock = Configuration for text files where text units are separ
ated by 2 or more line-breaks.
 - okf_regex-macStrings = Configuration for Macintosh .strings files.
 - okf_ts = Configuration for Qt TS files.
 - okf_tmx = Configuration for Translation Memory eXchange (TMX) documents.
 - okf_xml = Configuration for generic XML documents (default ITS rules).
 - okf_xml-resx = Configuration for Microsoft RESX documents (without binary dat
a).
 - okf_xml-MozillaRDF = Configuration for Mozilla RDF documents.
 - okf_xml-JavaProperties = Configuration for Java Properties files in XML.
 - okf_xml-AndroidStrings = Configuration for Android Strings XML documents.
 - okf_xml-WixLocalization = Configuration for WiX (Windows Installer XML) Local
ization files.
 - okf_idml = Adobe InDesign IDML documents
 - okf_json = Configuration for JSON files
 - okf_phpcontent = Default PHP Content configuration.
 - okf_ttx = Configuration for Trados TTX documents.
 - okf_pensieve = Configuration for Pensieve translation memories.
 - okf_vignette = Default Vignette Export/Import Content configuration.
 - okf_vignette-nocdata = Vignette files without CDATA sections.
 - okf_railsyaml = Ruby on Rails YAML files
 - okf_xmlstream = Large XML Documents
 - okf_xmlstream-dita = DITA XML
 - okf_xmlstream-JavaPropertiesHTML = Java Properties XML with Embedded HTML
 - okf_versifiedtxt = Versified Text Documents
 - okf_table = Table-like files such as tab-delimited, CSV, fixed-width columns,
 etc.
 - okf_table_csv = Comma-separated values, optional header with field names.
 - okf_table_catkeys = Haiku CatKeys resource files
 - okf_table_src-tab-trg = 2-column (source + target), tab separated files.
 - okf_table_fwc = Fixed-width columns table padded with white-spaces.
 - okf_table_tsv = Columns, separated by one or more tabs.
 - okf_plaintext = Plain text files.
 - okf_plaintext_trim_trail = Text files; trailing spaces and tabs removed from
extracted lines.
 - okf_plaintext_trim_all = Text files; leading and trailing spaces and tabs rem
oved from extracted lines.
 - okf_plaintext_paragraphs = Text files extracted by paragraphs (separated by 1
 or more empty lines).
 - okf_plaintext_spliced_backslash = Spliced lines filter with the backslash cha
racter (\) used as the splicer.
 - okf_plaintext_spliced_underscore = Spliced lines filter with the underscore c
haracter (_) used as the splicer.
 - okf_plaintext_spliced_custom = Spliced lines filter with a user-defined splic
er.
 - okf_plaintext_regex_lines = Plain Text Filter using regex-based linebreak sea
rch. Extracts by lines.
 - okf_plaintext_regex_paragraphs = Plain Text Filter using regex-based linebrea
k search. Extracts by paragraphs.
 - okf_odf = XML OpenDocument files (e.g. use inside OpenOffice.org documents).

La que nos interesa es:

- okf_table_csv = Comma-separated values, optional header with field names.

Así pues, para realizar la conversión tenemos que escribir (si el fichero a transformar se llama corpus-ONU-eng-spa.txt):

tikal -2tmx corpus-ONU-eng-spa.txt -sl en -tl es -fc okf_table_src-tab-trg

Y el sistema realizará el proceso de conversió y escribirá:

-------------------------------------------------------------------------------
Okapi Tikal - Localization Toolset
Version: 2.0.23
-------------------------------------------------------------------------------
Conversion to TMX
Source language: en
Target language: es
Default input encoding: windows-1252
Filter configuration: okf_table_src-tab-trg
Output: corpus-ONU-eng-spa.txt.tmx
Input: /D:/okapi/corpus-ONU-eng-spa.txt
Done in 3.568s

El fichero transformado se llama corpus-ONU-eng-spa.txt.tmx