Methodology

Methods for Sequence Analysis

Sequence analysis treats a whole ordered trajectory, a career, a patient's clinical course, or a firm's string of corporate actions, as the unit of analysis, and asks which orderings and timings cluster together.

In short

Sequence analysis encodes each case as an ordered string of states or events, measures pairwise dissimilarity between cases (optimal matching and its variants), then clusters or MDS-maps the resulting distance matrix into a typology of trajectories. For finance it complements the event study by capturing the order and timing of moves, not just isolated announcements: an event study prices one move in isolation, while sequence analysis asks whether the pattern of moves itself carries information.

State sequences vs event sequences

The first modelling choice is what a "position" in the sequence means. A state lasts: it occupies an interval and has a duration. An event occurs: it happens at a point in time and does not last. The distinction decides which methods apply, because conventional sequence analysis, and almost all of the TraMineR toolkit, is built for state sequences whose dominant goal is to build a typology of observed trajectories. Event-sequence methods (event-sequence mining, transition analysis) exist but are a smaller literature. Andrew Abbott's founding paper, "Sequence Analysis: New Methods for Old Ideas" (Abbott 1995), imported optimal matching from biological sequence alignment into the social sciences precisely to treat whole ordered careers as the unit of analysis rather than decomposing them into isolated transitions.

State sequences
An ordered list of the states occupied at successive, usually evenly spaced, positions: monthly employment status over a career, marital status, courses completed. States last and have a duration, the grid is regular, and the goal is a trajectory typology. This is the sequence-analysis mainstream and what nearly all of TraMineR targets.

Event sequences
An ordered list of events tied to specific time points, with no duration: a firm's acquisition, then a layoff, then a dividend cut. Timing is typically irregular and the methods (event mining, transition analysis) are a smaller, distinct literature. Any state sequence can be re-expressed as the event sequence of its state changes, and back again under assumptions, but the metrics differ.

Alphabet. The alphabet is the finite set of distinct states a sequence may take. Short labels are used for printing and plotting, while longer descriptive labels are carried as a separate legend attribute. In the distance-learning example below, the alphabet is the six states {A, B, C, D, E, F}.

Encoding: alphabet, STS and SPS

A worked example fixes the formats. Track six students in a distance-learning programme over 20 time periods, where the state records how many courses each has completed: F, E, D, C, B, A correspond to 0, 1, 2, 3, 4, 5 courses completed. One student who steadily progresses produces a row that can be written two ways. The STS (State-Sequence) format is the most intuitive layout: one row per case, one column per time position in chronological order, each cell holding the state at that position. This is what an index plot draws directly. The SPS (State-Permanence-Sequence) format writes each distinct successive spell once together with its duration, compressing runs and making durations explicit, which is the natural input to spell-based distances such as OMspell and OMslen.

STS  F-F-F-E-E-E-D-D-D-D-C-C-C-B-B-B-A-A-A-A

SPS  (F,3)-(E,3)-(D,4)-(C,3)-(B,3)-(A,4)

TraMineR's seqformat() converts among STS, SPS and other layouts (SPELL or episode data, with one row per spell and its start and end, and person-period, with one row per case-time). Encoding decisions, the granularity of the alphabet, the time unit, and how missing positions and right-censoring are treated, materially affect every downstream result, so they belong in the methods writeup rather than an appendix.

Index plot of six learner trajectories: one stacked bar per student, segment colour = state, segment length = duration in that state. Sorting the bars (for example by time to completion) turns visual noise into a readable gradient.

Measuring dissimilarity

Everything downstream hinges on a pairwise dissimilarity matrix \( D \), where \( D_{ij} \) is the distance between sequence \( i \) and sequence \( j \). The measure decides what kind of difference the analysis can see. Studer and Ritschard (2016) organise the choices around three sequence properties: sequencing (the order in which states appear), timing (where in time a state occurs), and duration (how long is spent in each state). Different measures weight these differently, and there is no universally best measure.

Key point

The dissimilarity measure, not the data alone, decides whether the analysis is sensitive to sequencing, timing, or duration. Choose it from the research question and report it explicitly: there is no measure that is correct for every study (Studer and Ritschard 2016).

The workhorse is Optimal Matching (OM), the edit or Levenshtein distance: the minimum total cost of the operations needed to turn one sequence into the other, computed by dynamic programming over all edit scripts \( E(x,y) \):

\[ d_{OM}(x,y) = \min_{e \in E(x,y)} \sum_{k} c(e_k) \]

Two operation types are allowed: substitution, replacing one state with another, and insertion or deletion (indel), which shifts states along the timeline. The result depends entirely on the cost scheme: the substitution-cost matrix sm(p,q), giving the cost of replacing state p with state q (how dissimilar the two states are), and the indel cost of one insert or delete.

One OM edit. Substituting B for E keeps the position fixed; inserting a state shifts later states in time. The indel-to-substitution ratio is the dial that trades timing against ordering.

The cost scheme is not a detail: it is where the analyst's assumptions live. The ratio of indel to substitution cost governs whether OM aligns by position and timing or by order and sequencing. When indels are cheap relative to substitution, the algorithm prefers to shift states in time (warping), so OM emphasises order; when indels are expensive, it prefers in-place substitution, so OM emphasises contemporaneity and timing. As a rule of thumb, if the indel cost is at least half the maximum substitution cost, substitution is never used and OM reduces to a function of the Longest Common Subsequence. TraMineR's seqcost() generates the schemes: CONSTANT (uniform off-diagonal), TRATE (from observed transition rates, so frequently interchanged states are cheaper to substitute), FUTURE (chi-squared distance between conditional future state distributions), FEATURES (Gower distance between externally coded state features), and INDELS / INDELSLOG (data-driven indel costs from state frequencies); indel = "auto" sets the indel from the substitution matrix.

Pitfall. Sequence-analysis results are not robust to the cost scheme. Cheap indels emphasise the order of states; expensive indels emphasise their calendar timing; a constant substitution cost treats every pair of states as equally different, which is rarely true. Always report the substitution matrix sm and the indel cost alongside the typology, because two defensible cost choices can produce two different clusterings of the same data.

OM is one of a family. The catalogue below is the spine of the method: each measure foregrounds a different one of sequencing, timing and duration, and the right column points to when you would reach for it.

Measure	Family	Most sensitive to	Note
OM	Edit distance	Tunable (order or timing)	Min-cost indel + substitution; the ratio sets the emphasis.
OMloc	Edit distance	Sequencing	Localized OM: indel cost depends on local context, weighting order.
OMslen	Edit distance	Duration	Spell-length-sensitive: costs scaled by spell length.
OMspell	Edit distance (spells)	Duration + sequencing	OM on the SPS spell representation; accounts for spell durations.
OMstran	Edit distance (transitions)	Sequencing	OM on the sequence of transitions, emphasising order of changes.
HAM	Position-wise	Timing	Hamming: substitution only, no indels; equal length, no warping.
DHD	Position-wise	Timing	Dynamic Hamming: position-specific substitution-cost matrices.
TWED	Time-warp edit	Shape (warping-tolerant)	Controlled warping via stiffness + gap-penalty lambda.
LCS / LCP / RLCP	Subsequence	Sequencing	Longest common subsequence / prefix / suffix shared structure.
NMS / SVRspell	Subsequence	Sequencing	Number of matching subsequences (optionally time-weighted); vectorial representation.
CHI2 / EUCLID	Distribution	Duration / composition	Compare cross-sectional state distributions; ignore exact order.

Synthesising Studer and Ritschard (2016): if the order of events is the hypothesis, use OM with cheap indels, OMstran, or LCS; if exact calendar timing matters, use Hamming or DHD; if duration in states matters, use OMspell, OMslen, CHI2 or EUCLID. Whatever you pick, report sm and the indel.

From distances to a typology

The \( n \times n \) dissimilarity matrix is the bridge from sequences to standard multivariate tools. Two complementary routes turn \( D \) into something interpretable: clustering produces a small set of ideal-typical trajectories, while multidimensional scaling produces continuous coordinates. They are often used together: the MDS axes describe the gradients, the clusters name the corners.

Clustering
Apply a distance-based algorithm directly to \( D \): most often Ward agglomerative hierarchical clustering or PAM (partitioning around medoids). Choose the number of clusters with quality indices (average silhouette width ASW, Hubert's C, Point-Biserial correlation) computed by the WeightedCluster package (Studer 2013); Studer (2021) adds parametric-bootstrap validation to guard against reading structure into noise.

Multidimensional scaling
Embed \( D \) into a low-dimensional (often 2-D) space so the cloud of sequences can be plotted. The principal MDS axes frequently carry substantive meaning, for example an early-versus-late timing axis, and can be used as continuous covariates in downstream regressions, instead of or alongside the discrete cluster typology.

Visualizing sequences

Sequence analysis is unusually visual: each plot answers a specific question about the cohort, and reading them in sequence is how analysts sanity-check a typology before trusting it. The six standard plots, with their TraMineR function names, each address one question.

Index plot (seqIplot)
One stacked bar per case, coloured by state, length by duration. Answers: what do the individual trajectories look like, and what gradient appears once they are sorted?

Sequence-frequency plot (seqfplot)
The most frequent whole sequences, each a stacked bar with height proportional to frequency. Answers: what are the common end-to-end patterns?

State-distribution plot / chronogram (seqdplot)
The cross-sectional state composition at each position over time. Answers: how does the aggregate mix of states evolve?

Modal-state plot (seqmsplot)
The single most common state at each position. Answers: what is the typical state at each point in time?

Mean-time plot (seqmtplot)
The mean total time spent in each state across the sample. Answers: where does the cohort spend its time overall?

Transversal-entropy plot (seqHtplot)
Shannon entropy of the cross-sectional distribution per position. Answers: when are trajectories diverse (high entropy) versus concentrated (low entropy)?

Chronogram (state-distribution plot): the cross-sectional mix of states shifts from "few courses completed" (dark, period 1) to "all completed" (light, period 20). It shows aggregate composition over time, not individual order.

The TraMineR ecosystem

The R toolchain and where each piece fits

TraMineR (Gabadinho, Ritschard, Müller and Studer 2011, Journal of Statistical Software 40(4)) is the core package: define sequence objects (seqdef), convert formats (seqformat), compute distances (seqdist) and costs (seqcost / seqsubm), and render every plot above (the seqplot family).

TraMineRextras is the companion package with additional and experimental functions: extra dissimilarities, event-sequence helpers, and dyadic or polyadic sequence tools.

WeightedCluster (Studer 2013) handles cluster construction and validation on the distance matrix: weighted data, cluster quality indices, and the PAM and Ward workflow tailored to sequences.

For background, the standard texts are Cornwell (2015), Social Sequence Analysis; Aisenbrey and Fasang (2010) on the "second wave"; Abbott (1995) and Abbott and Tsay (2000) on the origins and the optimal-matching debate; and Studer and Ritschard (2016) for the definitive comparison of dissimilarity measures.

Sequences of news: studying corporate-action sequences

A firm's strategic life is a sequence of moves: announcements, acquisitions, divestitures, buybacks, dividend changes, restructurings, leadership changes, guidance revisions. Treating each move in isolation, the classic single-event-study frame, discards the order and timing in which the moves arrive. Sequence analysis instead treats the whole ordered string of moves as the unit of analysis. There are two natural encodings. In the event-sequence view, each corporate action is an event at a date (acquire, then lay off, then cut dividend). In the state-sequence view, the firm's posture is encoded per period (expanding, consolidating, distressed, stable) and the analysis studies trajectories of strategic posture. Both are legitimate, and the choice follows the question.

The payoff is in what sequence analysis sees that an event study cannot. An event study estimates the abnormal return to one move, holding the rest of the world fixed via an expected-return model. It is silent on whether a buyback that follows a dividend cut reads differently from one that precedes it, or whether firms that sequence divestiture-then-acquisition outperform those that acquire-then-divest. Sequence analysis builds a typology of move orderings and lets you ask whether the pattern, not the isolated event, predicts performance, survival, or subsequent abnormal returns.

Key point

The cleanest design treats the two methods as a complementary two-stage pipeline. First, use event studies to measure the market's per-move abnormal return (CAR) for each action in each firm's history. Second, use sequence analysis to cluster firms by the order and timing of their moves, then test whether the resulting strategy typology explains differences in the event-study CARs: do serial-acquirer sequences, for example, earn lower announcement returns than focused-divestiture sequences? The measure follows the hypothesis: if the order of moves is the question, use sequencing-sensitive OM, OMstran or LCS; if how long a firm holds a strategic posture matters, use spell-based OMspell or OMslen.

Caveats for firm data. Corporate-action histories are irregularly spaced (events do not arrive on a monthly grid), right-censored (the firm is still alive at the end of the window), and built on small, heterogeneous alphabets. These properties favour event-sequence or spell-based encodings and warping-tolerant measures (TWED, OMspell) over rigid position-wise Hamming. State the encoding and the cost scheme explicitly: the typology is only as defensible as those two choices.

References

Abbott, A. (1995). Sequence Analysis: New Methods for Old Ideas. Annual Review of Sociology, 21, 93-113. annualreviews.org
Abbott, A., & Tsay, A. (2000). Sequence Analysis and Optimal Matching Methods in Sociology: Review and Prospect. Sociological Methods & Research, 29(1), 3-33. doi.org/10.1177/0049124100029001001
Aisenbrey, S., & Fasang, A. E. (2010). New Life for Old Ideas: The "Second Wave" of Sequence Analysis Bringing the "Course" Back Into the Life Course. Sociological Methods & Research, 38(3), 420-462. doi.org/10.1177/0049124109357532
Cornwell, B. (2015). Social Sequence Analysis: Methods and Applications. Cambridge University Press. cambridge.org
Gabadinho, A., Ritschard, G., Müller, N. S., & Studer, M. (2011). Analyzing and Visualizing State Sequences in R with TraMineR. Journal of Statistical Software, 40(4), 1-37. doi.org/10.18637/jss.v040.i04
Studer, M. (2013). WeightedCluster Library Manual: A practical guide to creating typologies of trajectories in the social sciences with R. LIVES Working Paper 24. cran.r-project.org
Studer, M., & Ritschard, G. (2016). What matters in differences between life trajectories: a comparative review of sequence dissimilarity measures. Journal of the Royal Statistical Society: Series A, 179(2), 481-511. doi.org/10.1111/rssa.12125
TraMineR seqdist and seqcost reference documentation, University of Geneva. traminer.unige.ch/doc/seqdist.html