Why isn’t the biggest theme circle on the Concept Map the top one in the associated Thematic Summary report tab?
A Leximancer ‘Theme’ is a group or cluster of Concepts that have some commonality or connectedness as seen from their close proximity on the Concept Map. The size of the Theme circle has no bearing as to its prevalence or importance in the text; the circles are merely boundaries. Prevalence is determined by the number of Concepts present in the Theme and this is indicated in the Thematic Report. The histogram bars in the Thematic Report are color-coded (hot - cold) to further signify the prevalence of the Theme - and this color is carried through to the Theme circle boundary color. The number, or granularity, of Theme circles on the Concept Map can be varied using the ‘slider’ beneath the Concept Map.
The dashboard lets you divide concepts and tags into dependent variables (categories) and independent variables (attributes). The report lists both relative frequencies for each combination, plus calculates a score we call prominence. Prominence is defined as the joint probability divided by (the product of the marginal probabilities). A score of > 1.0 for prominence indicates that the co-occurrence happens more often than chance (i.e., the items are not independent).
There is a View Log bottom on the bottom right of the screen when the Project Control Panel is displayed. Click on this button and everything Leximancer has done will be displayed.
You can see options set and many statistics such as terms processed. These statistics are often important to include in a results section of a research paper.
In the Themes tab, in the thematic summary, what does the Connectivity Percent and Relevance represent?
Themes on a Leximancer Concept Map are heat-mapped, meaning that hot colors (red, orange) denote the most important themes, and cool colors (blue, green), denote those less important. The Thematic Summary includes a connectivity score to indicate the relative importance of the themes (the most important is the top Theme at 100%). This score is calculated using the connectedness of concepts within that theme giving us a way to measure the importance of a theme within the dataset.
Don’t forget you can choose the % theme size by using the middle slider on the bottom of the Map.
For instance, if a concept called technology is included in the list and has a relevance of 19%, how do I interpret that?
Relevance is just the percentage frequency of text segments which are coded with that concept, relative to the frequency of the most frequent concept in the list. Thus, the most frequent concept will always be 100%. This does not mean all text segments contain that concept. Other Relative Percents are calculated by dividing a concept’s count into the top occurring (100% Relevance) concept’s count.
This measure is an indicator of the relative strength of a concept’s frequency of occurrence.
If I select a concept in the Concept Tab, I see a list of related concepts and counts. What does the likelihood percentage mean?
For instance, suppose a concept named physics was clicked then the related concept list contained the concept Smith, which occurs 27 times and has a Likelihood score of 36%. What does the 36% mean?
It means that 36% of the text segments with Smith also contain physics. This statistic complements the count statistic, to give both directions of conditional probability.
In the Thesaurus tab, what does the score next to the Thesaurus terms stand for and how it is calculated? Why would information like this besignificant?
As a concrete example, suppose a thesaurus concept protester and one of the words associated with that concept aspire has a score of 2.86.
It would mean that the word aspire appears more often in the same context block as other high ranked terms from the concept protester, and much less frequently in other context blocks. It is a measure of how tightly the word aspire is bound to the family of other terms that makes up protestor. The default context block is 2 sentences, but not crossing paragraph boundaries.
A high relevancy word may not be frequent, but if it appears, it usually appears with other strong words for its concept.
In the thesaurus browser, you can click on the drill down button to see where the word is used in the context of the concept.
Another way of looking at it is a measure of how strongly the presence of the word aspire predicts the concept protestor. This statistic is derived from word co-occurrence information in the actual text data you are analyzing. If you want to read more about it, try finding the weighted Naive Bayes classifier algorithm as discussed in Salton: Automatic Text Processing. The Leximancer algorithm is largely based on that, but with improvements for our seeded classifier approach.
Leximancer uses its patented algorithm to rank the concepts by connectedness (summed co-occurrence with all other concepts). The algorithm then starts at the top of the ranking and creates a theme group centered on the top concept. It then goes to next ranked concept and either:
If the next concept is near enough to any other theme group centroid on the map, join nearest theme and adjust centroid of that theme
start a new theme group centered on that concept.
One of the principal aims of Leximancer is to quantify the relationships between concepts (i.e. the co-occurrence of concepts), and to represent this information in a useful manner (in a concept map) that can be used for exploring the content of the documents. The concept map can be thought of as a bird’s eye view of the data, illustrating the main features (i.e. concepts) and how they interrelate.
The mapping phase generates a two dimensional projection of the original high dimensional co-occurrence matrix between the concepts. This is a difficult problem, and one which does not necessarily have a unique solution. The resulting cluster map is similar to Multi-Dimensional Scaling, or MDS, but actually uses the relative co-occurrence frequencies as relationship strengths which leads to asymmetric relationships between entities. An asymmetric matrix of relationships cannot be dealt with by standard MDS. It must be emphasised that the process of generating this map is stochastic. Concepts on the map may settle in different positions with each generation of a new map. In understanding this, consider that concepts are initially scattered randomly throughout the map space. If you imagine the space of possible map arrangements as a hilly table top, and you throw a marble from a random place on the edge, the marble could settle in different valleys depending on where it starts. There may be multiple ‘shallow valleys’ (local minima) in the map terrain if words are used ambiguously and the data is semantically confused. In this case the data should not form a stable pattern anyway.
Another possibility is that some concepts in the data should in fact be stop words, but aren’t in the list. An example of this is theemergence of the concept ‘think’ in interview transcripts. Thisconcept is often bleached of semantic meaning and used by conventiononly. The technical result of the presence of highly-connected and indiscriminate concept nodes is that the map to loses differentiation and stability. The over-connected concept resembles a mountain which negates the existence of all the valleys in the terrain. To fix this, remove the over-connected concept.
The practical implication is that for a strict interpretation of the cluster map, the clustering should be run several times from scratch and the map inspected on each occasion. If the relative positioning of the concepts is similar between runs, then the cluster map is likely to be representative. Note that rotations and reflections are permitted variations. If the map changes in gross structure, then revision of some of the parameters is required.
In concept mapping, there is one setting to choose, namely whether to use a Topical or Social map. The Social map has a more circular symmetry and emphasizes the similarity between the conceptual context in which the words appear. A Social map is best when entities tend to be related to fewer other entities, such as a map made up of many name concepts.
The Topical map, by comparison, is more spread out, emphasizing the co-occurrence between items. It tends to emphasize differences and direct relationships, and is best for discriminant analysis. The Topical map is also much more stable for highly connected entities, such as topics. The most common reason for cluster instability is that the concepts on the map are too highly connected, and no strong pattern can be found. The Topical variant of the clustering algorithm produces more stability in maps of this kind, so switching to this setting will often stabilize the map. However, the most important settings which govern the connectedness of the map are the classification thresholds and the size of the coded context block, which are located in the Classification Settings in the Locate Concepts node. If the coded context block is too large, or the classification threshold is too low, then each concept will tend to be related to every other concept. If you have some highly-connected concepts which are effectively bleached of meaning in your data, removing from the concept lists in the Concept Seeds Editor will often stabilize the map. Words such as ‘sort’, ‘think’, and ‘kind’ often appear in spoken transcripts and may be used as filler words which are essentially stop words. Inspect the actual text locations to check the way words like these are being used before removing them.
In summary, the Topical clustering algorithm is more stable than the Social, but will discover fewer indirect relationships. The cluster map should be considered as indicative and should be used for generating hypotheses for confirmation in the text data. It is not a quantitative statement of fact.
Several places in Leximancer I see document names appended withsomething like ~1.html 7_1, what do these numbers mean?
The number after the ~ represents which text surrogate number Leximancer has placed this particular piece of the document’s plain text. Leximancer will break larger documents into a collection of plain text surrogates for display purposes.
The number before the underscore is the document section #: For text documents such as Microsoft Word or pdf documents, it is the document section, which will be 1.
For comma or tab (CSV/TSV) delimited spreadsheet data, however, then a section is a single free text cell number. So the total number of sections will be the total number of free text cells in the spreadsheet.
The number after the underscore is the sentence number within the section.