How to tag and remove sections of research publications

Andrew E Smith

I recently wrote a blog showing how to quickly analyse and review your EndNote PDF collection with Leximancer. I thought it might be useful to also describe some additional tips on how to tag different sections of your articles for analysis, and also filter out so-called boilerplate based on its distinctive attributes.

How do I add tags to my text documents so that Leximancer automatically knows which section each text sentence belongs to? I need to analyse different sections of the articles separately.

You can do this by using the Dialog Tag functionality, which is also documented in the section on analysing transcripts (pages 122 to 128) of the manual. Later in this article, I will also outline a technique which can allow you to automatically filter much of the bibliography sections without needing to edit every article.

Valid Leximancer Dialog Tags must appear at the start of a paragraph, after a blank line, and must be made up of a maximum of three words, all of which must start with upper case. Don’t make such a tag a stop word (e.g I, Me, You) or a single letter or number. The tag must end with a colon followed by white space, followed by possibly multiple associated paragraphs of text.

For example (the bold font is just to show the tags you need to add to your article files - there is absolutely no need to adjust the font in practice) :

My Abstract Tag:   Text here....

Text here. Text here...

More text here...

My Background Tag:  Text here. Text here...

Text here....

My Bibliography Tag: Text here.

Text here…

The dialog markers function in Leximancer will automatically code all text segments following such a marker with that tag, until it encounters a new tag. You could use a miscellaneous tag to close out other tagged sections. Please note that for data that includes dialog tags, it is best to use the .doc or .txt text file formats. Dialog Tags in the .docx file format are not parsed as reliably.

To enable dialog tag identification in Leximancer, this image shows where to turn on the feature:

Dialog Tagging.png

These tags appear in Leximancer as SPEAKER: tag variables, but they don't have to be used just for dialog. Here are some article section tags extracted from a paper that I modified as described above:

Article tags list.png

On the other hand, if you only want to tag individual sentences, you can simply insert a unique proper name as a code at the start each sentence.

Is it possible to filter out parts of an article with some distinctive differences, such as bibliography entries with the usual bibliographic terms in them, WITHOUT editing any of the text data files.

Yes, this is certainly possible with Kill Concepts. As an aside, please don't use Kill Concepts just to remove a concept from the map. It is much too strong a filter for that. Use the Mapping Concepts list to select what concepts and tags appear on the map

To use Kill Concepts to filter out content, I normally run an automatic map first and may increase the number of automatic concepts to get a richer map. Boiler plate will normally produce a little tight cluster of concepts towards the edge of the map, like an outlying island. Take a note of the concepts that are central to this outlying island - they are distinctive markers for the underlying boilerplate text. Now you can add some or all of these distinctive concepts to the Kill Concepts list. This causes all the text segments that match these concepts to be completely ignored by the coding and indexing engine. You can normally quarantine the majority of bibliography entries by doing this - the concepts that usually work here are such things as Vol, Pp, et, al, eds. 

Importantly you can also use Kill Concepts, or the reverse of these called Required Concepts, to ignore or only analyse your sectional SPEAKER tags that I described in the first part of this article.


Andrew Smith.

Leximancer Pty Ltd, Brisbane, Australia, ACN: 116 218 109