Introducción
En esta entrada explicaremos cómo transformar una memoria de traducción que esté en formato de texto tabulado a tmx utilizando tikal de Okapi Tools (http://okapi.opentag.com/). Con las herramientas de Okapi se pueden realizar infinidad de tareas relacionadas con la traducción y la localización, entre ellas la tarea que nos ocupa.
Recordemos que una memoria de traducción en formato de texto tabulado tiene el siguiente aspecto:
segmento_lengua_A tabulador segmento_lengua_B.
Si la lengua A es inglés y la lengua B es español este mismo segmento en TMX tendría el siguiente aspecto:
Esta transformación nos permitirá utilizar las memorias de traducción en herramientas de traducción asisitida.
Transformación con Tikal
Si ejecutamos tikal sin ningún parámetro adicioal nos muestra todas sus instrucciones:
Si nos fijamos bien (haciendo clic sobre la imagen se ampliará) existe una opción -2tmx que nos servirá para realizar la conversión a este formato. Si miramos bien los parámetros que nos pedirá veremos lo siguiente:
Además de indicar la lengua de partida (-sl) y la de llegada (-tl) será muy importante dar el valor adecuado al parámetro -fc que es el parámetro que controla el formato del ficher de entrada. Tenemos que darle un valor de entre la lista dfe valores posibles. Esta lista la podemos obtener escribiendo:
tikal -listconf
Que nos devolverá la lista de configuraciones:
D:\okapi>tikal -listconf ------------------------------------------------------------------------------- Okapi Tikal - Localization Toolset Version: 2.0.23 ------------------------------------------------------------------------------- List of all filter configurations available: - okf_txml = Wordfast Pro TXML documents - okf_txml-fillEmptyTargets = Wordfast Pro TXML documents with empty targets fi lled on output. - okf_itshtml5 = Configuration for standard HTML5 documents. - okf_doxygen = Doxygen-commented Text Documents - okf_wiki = Text with wiki-style markup - okf_mosestext = Default Moses Text configuration. - okf_tradosrtf = Configuration for Trados-tagged RTF files - READING ONLY. - okf_rainbowkit = Configuration for Rainbow translation kit. - okf_rainbowkit-package = Configuration for Rainbow translation kit package. - okf_rainbowkit-noprompt = Configuration for Rainbow translation kit (without prompt). - okf_mif = Adobe FrameMaker MIF documents - okf_archive = Configuration for archive files - okf_transifex = Transifex project with prompt when starting - okf_transifex-noPrompt = Transifex project without prompt when starting - okf_xini = Configuration for XINI documents from ONTRAM - okf_xini-noOutputSegmentation = Configuration for XINI documents from ONTRAM (fields in the output are not segmented) - okf_xliff = Configuration for XML Localisation Interchange File Format (XLIFF ) documents. - okf_openxml = Microsoft Office documents (DOCX, XLSX, PPTX). - okf_openoffice = OpenOffice.org ODT, ODS, ODP, ODG, OTT, OTS, OTP, OTG docume nts - okf_simplification = Configuration for extracting resources from an XML file. Resources and then codes are simplified. - okf_simplification-xmlResources = Configuration for extracting resources from an XML file. Resources are simplified. - okf_simplification-xmlCodes = Configuration for extracting resources from an XML file. Codes are simplified. - okf_properties = Java properties files (Output used \uHHHH escapes) - okf_properties-outputNotEscaped = Java properties files (Characters in the ou tput encoding are not escaped) - okf_properties-skypeLang = Skype language properties files (including support for HTML codes) - okf_properties-html-subfilter = Java Property content processed by an HTML su bfilter - okf_dtd = Configuration for XML DTD documents (entities content) - okf_html = HTML or XHTML documents - okf_html-wellFormed = XHTML and well-formed HTML documents - okf_po = Standard bilingual PO files - okf_po-monolingual = Monolingual PO files (msgid is a real ID, not the source text). - okf_regex = Default Regex configuration. - okf_regex-srt = Configuration for SRT (Sub-Rip Text) sub-titles files. - okf_regex-textLine = Configuration for text files where each line is a text u nit - okf_regex-textBlock = Configuration for text files where text units are separ ated by 2 or more line-breaks. - okf_regex-macStrings = Configuration for Macintosh .strings files. - okf_ts = Configuration for Qt TS files. - okf_tmx = Configuration for Translation Memory eXchange (TMX) documents. - okf_xml = Configuration for generic XML documents (default ITS rules). - okf_xml-resx = Configuration for Microsoft RESX documents (without binary dat a). - okf_xml-MozillaRDF = Configuration for Mozilla RDF documents. - okf_xml-JavaProperties = Configuration for Java Properties files in XML. - okf_xml-AndroidStrings = Configuration for Android Strings XML documents. - okf_xml-WixLocalization = Configuration for WiX (Windows Installer XML) Local ization files. - okf_idml = Adobe InDesign IDML documents - okf_json = Configuration for JSON files - okf_phpcontent = Default PHP Content configuration. - okf_ttx = Configuration for Trados TTX documents. - okf_pensieve = Configuration for Pensieve translation memories. - okf_vignette = Default Vignette Export/Import Content configuration. - okf_vignette-nocdata = Vignette files without CDATA sections. - okf_railsyaml = Ruby on Rails YAML files - okf_xmlstream = Large XML Documents - okf_xmlstream-dita = DITA XML - okf_xmlstream-JavaPropertiesHTML = Java Properties XML with Embedded HTML - okf_versifiedtxt = Versified Text Documents - okf_table = Table-like files such as tab-delimited, CSV, fixed-width columns, etc. - okf_table_csv = Comma-separated values, optional header with field names. - okf_table_catkeys = Haiku CatKeys resource files - okf_table_src-tab-trg = 2-column (source + target), tab separated files. - okf_table_fwc = Fixed-width columns table padded with white-spaces. - okf_table_tsv = Columns, separated by one or more tabs. - okf_plaintext = Plain text files. - okf_plaintext_trim_trail = Text files; trailing spaces and tabs removed from extracted lines. - okf_plaintext_trim_all = Text files; leading and trailing spaces and tabs rem oved from extracted lines. - okf_plaintext_paragraphs = Text files extracted by paragraphs (separated by 1 or more empty lines). - okf_plaintext_spliced_backslash = Spliced lines filter with the backslash cha racter (\) used as the splicer. - okf_plaintext_spliced_underscore = Spliced lines filter with the underscore c haracter (_) used as the splicer. - okf_plaintext_spliced_custom = Spliced lines filter with a user-defined splic er. - okf_plaintext_regex_lines = Plain Text Filter using regex-based linebreak sea rch. Extracts by lines. - okf_plaintext_regex_paragraphs = Plain Text Filter using regex-based linebrea k search. Extracts by paragraphs. - okf_odf = XML OpenDocument files (e.g. use inside OpenOffice.org documents).
La que nos interesa es:
- okf_table_csv = Comma-separated values, optional header with field names.
Así pues, para realizar la conversión tenemos que escribir (si el fichero a transformar se llama corpus-ONU-eng-spa.txt):
tikal -2tmx corpus-ONU-eng-spa.txt -sl en -tl es -fc okf_table_src-tab-trg
Y el sistema realizará el proceso de conversió y escribirá:
------------------------------------------------------------------------------- Okapi Tikal - Localization Toolset Version: 2.0.23 ------------------------------------------------------------------------------- Conversion to TMX Source language: en Target language: es Default input encoding: windows-1252 Filter configuration: okf_table_src-tab-trg Output: corpus-ONU-eng-spa.txt.tmx Input: /D:/okapi/corpus-ONU-eng-spa.txt Done in 3.568s
El fichero transformado se llama corpus-ONU-eng-spa.txt.tmx