Text file localization can be very challenging. This is because the internal format of a text file can be almost anything. File can contain plain text or it can contain more structural data. For example, look at the following text file:
usa<tab>United States germany<tab>Germany japan<tab>Japan
The file contains three data records. Each line contains one record. Each record contains two data items: data id and data value. Items are separated by a tab character. When localizing this to Japanese the file will be:
usa<tab>アメリカ germany<tab>ドイツ japan<tab>日本
A more complex file format might be like this:
##usa;12;United States ##USA
Where each record is three lines. The first line contains start string (##), id of the country, some other information and name of the country all separated by semicolon character. Second line contains country code and third line is empty.
On the other hand there can be text files that do not contain any structured data but only plain text. For example:
The United States of America is a federal republic of 50 states, located primary on central North America. Germany or the Federal
Republic of Germany is one of the world's leading industrialized countries, located in the heart of Europe. Japan is a country on
the western edge of the Pacific Ocean.
Sisulizer supports both kinds of text files:
Plain text file contain plain text. Sisulizer can either read complete text as single string or it can break files into segments using segmentation rules.
A sample files (Plain.txt and Plain.slp) can be found from <sisulizer>\Text\Text directory.
Structured files contains one or more records. Each record contains one or more strings. Data is parsed using text definitions.
There can be thousands of different formats. In most cases the formats are proprietary file formats used by a single company or application only. This is why Sisulizer can not contain build-in support for them all. Fortunately Sisulizer has a powerful method to define text file formats. It is called Text definitions. Each text definition defines a format of a text file. The definition specifies the item rules, purpose of the file and character encoding.
Let's have an example. We have a simple text file containing country information. The file is
// This sample contains country descriptions. Each line contains one record. // id<tab>value usa The United States of America is a federal republic of 50 states, located primary on central North America. germany Germany or the Federal Republic of Germany is one of the world's leading industrialized countries, located in the heart of Europe. japan Japan is a country on the western edge of the Pacific Ocean.
File uses single line comments that start with // string. Each line contains one record having an id and value separated by tab character. How to we specify this format using Sisulizer's text definition rules. Rules define the format of a record. Each record contain two or more values. Possible values are:
|Ignore||Item contains a value that is ignored.|
|Context||Item contains context value.|
|Original||Item contains original value.||Only when importing to a project|
|Text||Item contains text value. This is either localized or imported depending on the purpose of the text definition.|
|Comment||Item contains a comment value.|
|Count||Item contains a count of the items in the file.||Only in the header fields of binary files|
|String length||Item contains the length of the string in characters if string is encoded as UTF-16. Otherwise this contains length of the string in bytes.||Only in the header fields of binary files|
|String size||Item contains the length of the string in bytes.||Only in the header fields of binary files|
There must be one context value and at least one text value in a record. Optionally it can contain any number of ignored values and one comment value. For each value you add a rule into text definitions. Each rule can have before and/or after expression that defines the characters that start or end the value. Each rule can have either both, either one or no expression. A before expression is not needed if the rule is the first rule or the previous rule contains an after expression. An after expression is not needed if the next rule contains a before expression.
For regular expression syntax documentation pages goto http://icu.sourceforge.net/userguide/regexp.html
Our sample contains context as the first value in the row. So the first rule will contain context value. There is no need for before expression because this is the first value in the row. Text value is separated from context with a tab character. This is why the after expression will be "\t". The second rule contains the text value. Because the first rule has after expression there is no need for before expression. The text value is ended either by new line or file end. This is why our after expression will be "\r\n|\z". Now our definition is complete and it contains two rules:
We could use expression to include comments but that would make expression much more complicated. This is why text definitions have build in support for comments. You can give two kinds of comments.
This is a comment that start with a comment string and ends with the new line. For example:
// This is a comment
In this sample line comment string is "//".
This is a comment that start with start comment string and ends with end comment string. The comment can contain several new lines. For example:
/* First line Second line */
In this sample start comment string is "/*" and end comment string is "*/".
If you set either of comments Sisulizer will skip comments if you expressions do not handle comments. If you have included comment handling into the logic of expressions then they are used. In most cases it is easier to ignore comments in the expression and enter either line or block comment strings.
Sisulizer can also ignore certain amount of lines in the beginning of the file. By default no lines are ignored. However some files contain fixed header data in the beginning of the file. In such case you can set ignored lines count to match the header size and to make Sisulizer to skip those lines.
A sample files (Country.txt and Country.slp) can be found from <sisulizer>\Text\Text directory. The following picture shows how to definition looks like in text source dialog after you have configured it.