Yes, but the results may not be immediately what you wanted. Leximancer does not perform automatic translation, so two concepts from different languages which mean the same thing are not automatically merged using normal text data. The resulting map would have largely separate concept clusters for each of the languages. You can merge hub concepts manually across languages, and if you merge enough, the language clusters will merge.
To map data from more than one language in the same project, you must do two things:
- specify the language for each data set when they are selected for project; and
- you must load a stoplist in the stoplist editor (using the ‘Load language’ button at the top) for each of the additional languages.
To automatically add the stop-list for a document language, the language should be selected in the dropdown language list when a source text document is added to a project, before any other steps in the project have been run.
If the language is not selected before the project is run, Leximancer sets a default set of stop-words which is usually for English. No further automatic changes are made to the stop-word list.
To add a new language or update specific stop-list terms after a project has run any steps, the stop-list editor dialogue should be used. This dialogue is available from the Text Processing Settings dialogue by pressing the
Edit stoplist button. New stop-list languages may be added from there.
There is a special form of data which will cause cross-lingual concepts to be discovered automatically, which can be very interesting. The data required is called an interlinear translation - each sentence in one language has the translation of that sentence into another language immediately after it. To process this sort of data, you need to create a special multi-language stoplist where all the stop words share the same literal language code in the list.
Leximancer has support for multiple languages built in.
You must select the language code for each data file when you drag it into the text selection panel. If you hover the mouse over the codes in the list, you can see the full name of each language. You may need to change the Charset (character encoding) from the default utf-8. This character encoding is an attribute of your data.
There are several considerations for using languages other than English. These are summarized below.
Stop Word Removal
Selecting the language when you drag across each data file or folder will activate a stop list for those languages. You may also need to change the Charset (character encoding) setting next to the language setting for some data sets. Character encoding is an attribute of your data.
Upper Case Words
In Leximancer, Name-like concepts are identified using upper case. In Leximancer maps and analysis, name-like concepts are not treated very differently from word-like concepts. However, this may be undesirable in some languages (e.g., German) due to the capitalization of all nouns. If you wish to disable the identification of proper names, there is a pre-processor stage setting to allow this. It is called: Identify Name-like Concepts. Deselect this option.
Leximancer also includes optional language stemmers for many languages. Just enable Merge Word Variants in the Preprocessor.
You can edit the stop word list in the settings for the Pre-Process stage of project control. There are most likely other stop words you will want to add to our default list. After saving the edited stop word list, you can open it again and use the download button to save your modified stop list to your local disk. You can upload this to other projects using the Upload button in the stop word editor.
The current list of supported languages can be found here. Languages that do not have readily identifiable word spacing cannot be used with Leximancer at this time (e.g., Mandarin).
There is a workaround for this situation. If you were to manually insert word breaks into Mandarin text it would be possible to upload your own stop list into a project if desired. Again, this is not something we provide support for at this time.
- ISO-8859-1, ISO Latin Alphabet No. 1,
- US-ASCII, American Standard Code for Information Exchange,
- UTF-8, Eight-bit UCS Transformation Format,
- WINDOWS-1252, Windows Western Alphabet,
- MacRoman, Apple Standard Roman,
- UTF-16, Sixteen-bit UCS Transformation Format, byte order identified by an optional byte-order mark,
- UTF-16BE,Sixteen-bit UCS Transformation Format, big-endian byte order,
- UTF-16LE,Sixteen-bit UCS Transformation Format, little-endian byte order,
- WINDOWS-1250, Windows Eastern European,
- WINDOWS-1251, Windows Cryillic,
- WINDOWS-1253, Windows Greek,
- WINDOWS-1254, Windows Turkish,
- WINDOWS-1257, Windows Baltic,
- ISO-8859-2,ISO Latin Alphabet No. 2,
- ISO-8859-4,ISO Latin Alphabet No. 4,
- ISO-8859-5,Latin/Cyrillic Alphabet,
- ISO-8859-7,Latin/Greek Alphabet,
- ISO-8859-9,ISO Latin Alphabet No. 5,
- ISO-8859-13,ISO Latin Alphabet No. 7,
- ISO-8859-15,ISO Latin Alphabet No. 9,
- KOI-R, KOI8-R Russian
KOI8-R, KOI8-R Russian