1. INTRODUCTION

Amongst cinephiles and physical media collectors, the Criterion Collection has become the definitive standard in world cinema, gaining a reputation for curatorial authority with every title inducted into its ranks. While new titles are inducted every month to maintain catalog vitality, the collection’s growth to over 1,300 titles means many of the most obvious, high profile titles have already been exhausted. To identify high-value acquisition targets that balance popularity, audience acclaim, and scarcity, I built a unified, filtered dataset, and incorporated both this data and my near decade of experience in the film industry to provide a data-driven roadmap for future releases.

Key Strategic Questions

  • Finding The “Blind Spot”: Which directors and genres show high audience scores, but low collection volume?

  • Assessing “Market Demand” Gap: Which directors and genres show high popularity scores, but low collection volume?

  • Priority Roadmap: Based on the “Expansion Potential” quadrant, as well as current licensing and availability, which specific titles should be prioritized for induction?

2. PROCURING AND CLEANING THE DATA

Initial Cleaning

Since up-to-date datasets for the Criterion Collection were unavailable on any public data repositories, such as Kaggle, I had to be more resourceful to find my data. I was able to locate a dataset on the film social media site Letterboxd that was thorough and completely up-to-date. While the site restricts public downloads of user created lists, I was able to use my “pro” account member access on Letterboxd to clone the list to my account and download that as a .csv file.

This initial file provided a comprehensive list of titles, though the formatting was non-uniform and the critical metadata, including Spine Number, Director, and Boxset if applicable—was nested within a single “Description” string. Furthermore, the dataset would need additional data points, such as an audience metrics, popularity value, and geographical data, for meaningful and significant analysis.

My primary cleaning phase focused on structural standardization and data extraction. Using Regex, I parsed the description strings to extract the distinct variables, transforming the unstructured text into standardized data values. Additionally, I generated a unique identification code for each title to assist with tracking, as the Spine Numbers are occasionally shared amongst multiple titles.

# Initial Cleaning and Extraction
criterion_cleaned <- criterion_raw %>% 
  select(-"Position", -"URL") %>% 
  clean_names() %>% 
  rename(title = name) %>% 
  mutate(
    description = str_remove_all(description, "[•,#]|<i>|</i>"), 
    description = str_replace_all(description, "Spine Number:", "Spine"), 
    across(c(description, title), str_squish)) %>% 
  mutate(
    director = 
      str_extract(description, "(?<=Directors?:\\s).*?(?=\\sSpine)"),
    spine = 
      as.integer(str_extract(description, "(?<=Spine:?\\s)\\d+")),
    boxset = 
      str_extract(description, "(?<=Collection:\\s).*?$")) %>% 
  select(-"description") %>% 
  mutate(boxset = replace_na(boxset, "Standalone")) %>% 
  mutate(movie_id = title %>% 
        str_to_lower() %>% 
        str_replace_all("[[:punct:]]", "") %>% 
        str_replace_all("\\s+", "-") %>% 
        paste(year, sep = "-")) %>% 
  mutate(year = as.integer(year)) %>% 
  relocate("spine")

My secondary phase involved augmenting the base dataset with additional metrics from the TMDb and OMDb APIs. I implemented an iterative scraping script to retrieve data from OMDb and TMDb, match that resulting data up to my cleaned master dataset, and save a copy of those results. These additional metrics would include metascore, average audience ratings, number of votes, and language.

While validating the TMDb data, I detected three key areas requiring amending:

Relational Language Mapping: The API returned language values as ISO-639-2 codes (i.e., “en”, “fr”); in order to expand these to more easily comprehensible values, I imported an external language look-up table, while manually amending it with three language values present in the TMDb dataset but not present in the look-up table: Cantonese, Serbo-Croatian, and None Spoken.

Film History Logic Application: The API inconsistently tagged the language values of silent films; I used my knowledge of film history to implement a conditional logic rule to standardize the language to “None Spoken” for every film released during the Silent Era (pre-1930), and then manually tagged the few silent titles released after 1930.

Edge Case Resolution: The API failed to locate on TMDb data for twelve titles, primarily the television titles present in the collection; as this number was relatively small, I addressed this by manually constructing a tibble with the missing values, ensuring a 100% record retention rate.

Following these adjustments, I then proceeded to update the TMDb data frame for those twelve titles, along with the silent films, and used a left join to merge the language look-up table. Finally, I performed a standard schema cleanup to prepare the TMDb dataset for the master integration.

lang_lookup <- ISO_639_2 %>% 
  select(Id = Alpha_2, language_full = Name) %>% 
  filter(!is.na(Id)) %>% 
  add_row(
    Id = c("cn", "sh", "xx"), 
    language_full = c("Cantonese", "Serbo-Croatian", "None Spoken")) %>% 
  mutate(language_full = str_replace_all(language_full, "Chinese", "Mandarin"))
    
tmdb_silent_patch <- tmdb_raw %>% 
  filter(year < 1930 | title %in% c(
    "À propos de Nice", "People on Sunday", "Borderline", "Limite", 
    "City Lights", "A Story of Floating Weeds", "Modern Times")) %>% 
  mutate(language = "xx")
    
tmdb_missing_patch <- tribble(
  ~title, ~year, ~tmdb_id, ~runtime_min, 
  ~genre, ~country, ~language, ~popularity, ~vote_average, ~vote_count,
  "The Underground Railroad", 2021, 80039, 585,
  "Drama, Science Fiction", "United States of America", "en", 7.8445, 7.2, 145,
  "Mangrove", 2020, 90705, 127,
  "Drama", "United Kingdom", "en", 8.2423, 7.4, 48,
  # ... [Truncated for readability]

tmdb_cleaned <- tmdb_raw %>% 
  rows_update(tmdb_silent_patch, by = c("title", "year")) %>% 
  rows_update(tmdb_missing_patch, by = c("title", "year")) %>% 
  mutate(
    spine = as.integer(spine),
    year = as.integer(year),
    tmdb_id = as.integer(tmdb_id),
    runtime_min = as.integer(runtime_min), 
    vote_count = as.integer(vote_count)) %>% 
  left_join(lang_lookup, by = c("language" = "Id")) %>% 
  select(-language) %>% 
  rename(
    tmdb_vote_count = vote_count,
    tmdb_vote_avg = vote_average,
    tmdb_popularity = popularity,
    language = language_full) %>% 
  relocate(language, .after = "country")

A parallel process was applied to the OMDb dataset to validate and ensure complete, consistent metrics. To streamline the process, I utilized the previously cleaned TMDb dataset to resolve records where the OMDb API initially retrieved null values. For these null values, I extracted their TMDb id, converted that to its corresponding IMDb id value, and then utilized a targeted iterative scraping script using that value to recover the missing metrics.

With both the IMDb and OMDb datasets complete and validated, I performed a final join between the two datasets. Because the IMDb and OMDb datasets were built directly from the earlier Criterion dataset, a complex three-way join was not necessary; instead, a standard join between the two sets was used to maintain relational integrity. This resulted in a final, high fidelity dataset of every title currently present in the Criterion Collection, along with additional audience and industry metrics for each title, that was now ready for exploratory analysis.

criterion_final <- omdb_cleaned %>% 
  select("movie_id", "imdb_rating", "metascore") %>% 
  right_join(tmdb_cleaned, by = c("movie_id")) %>% 
  mutate(director = str_replace_all(director, " &", ",")) %>% 
  relocate(imdb_rating, metascore, tmdb_id, movie_id, .after = last_col())
criterion %>% 
  glimpse()
## Rows: 1,689
## Columns: 16
## $ spine           <int> 1317, 1316, 1315, 1314, 1313, 1312, 1311, 1310, 1309, …
## $ title           <chr> "It Was Just an Accident", "Desperate Living", "Hairsp…
## $ year            <int> 2025, 1977, 1988, 1998, 1979, 1974, 2025, 1994, 1996, …
## $ director        <chr> "Jafar Panahi", "John Waters", "John Waters", "Lisa Ch…
## $ boxset          <chr> "Standalone", "Standalone", "Standalone", "Standalone"…
## $ runtime_min     <int> 103, 91, 92, 101, 116, 111, 133, 79, 85, 113, 130, 109…
## $ genre           <chr> "Drama, Thriller, Crime, Mystery", "Comedy, Crime", "C…
## $ country         <chr> "Iran, France, Luxembourg", "United States of America"…
## $ language        <chr> "French", "English", "English", "English", "French", "…
## $ tmdb_popularity <dbl> 10.1678, 31.7700, 4.3912, 6.6095, 0.4708, 2.0550, 15.8…
## $ tmdb_vote_avg   <dbl> 7.200, 6.413, 6.807, 6.200, 6.400, 7.305, 7.500, 5.300…
## $ tmdb_vote_count <int> 473, 115, 443, 142, 13, 298, 738, 15, 26, 669, 346, 27…
## $ imdb_rating     <dbl> 7.5, 7.0, 7.0, 6.6, 7.1, 7.5, 7.8, 6.0, 5.5, 7.4, 6.5,…
## $ metascore       <int> 91, NA, 77, 73, NA, 61, 86, NA, 60, 77, 55, 51, 76, 86…
## $ tmdb_id         <int> 1456349, 14262, 11054, 37636, 271858, 27094, 1124566, …
## $ movie_id        <chr> "it-was-just-an-accident-2025", "desperate-living-1977…

Reliability Filter

Before proceeding with the exploratory analysis, I implemented a data reliability filter to ensure that the audience rating metrics were statistically sound. By plotting the distribution of TMDb vote counts for titles in the Criterion Collection, I identified a long tail of data with minimal votes that could introduce bias in our analysis.

# Plot Vote Count
ggplot(criterion, aes(x = tmdb_vote_count)) +
  geom_histogram(bins = 50, fill = "steelblue", color = "white") +
  scale_x_log10() + 
  geom_vline(xintercept = 50, linetype = "dashed", color = "#e74c3c") +
  labs(title = "Distribution of TMDb Vote Counts",
       x = "Vote Count (Log Scale)", y = "Count of Films")

To mitigate this potential bias, I established a minimum threshold of 50 votes. This filter ensures that each title’s rating and popularity metrics are robust and reliable. Furthermore, I averaged the IMDb and TMDb audience scores together into a single Aggregated Audience Rating metric, simplifying our subsequent calculations and providing a more balanced audience representation.

# Apply A 50-Vote Threshold and Aggregate The Two Ratings
criterion_filtered <- criterion %>% 
  filter(tmdb_vote_count >= 50) %>%
  mutate(combined_rating = (imdb_rating + tmdb_vote_avg) / 2)

Following the application of this filter, the final dataset for analysis consists of:

Data Reliability: Post-Filter Title Count
Dataset Stage Title Count
Original Collection Titles 1689
Titles Meeting 50-Vote Threshold 1295
Titles Removed (Low Consensus) 394

With the dataset cleaned, validated, and filtered for statistical reliability, I shift from data preparation to a foundational overview of the collection’s primary attributes.

3. FOUNDATIONAL OVERVIEW OF THE COLLECTION

Distribution of Audience Ratings

With our dataset fully prepared for analysis, I generate three different exploratory visualizations to achieve general overview of the titles in the Criterion Collection before honing in on specifics. I begin by creating a histogram of the distribution of our aggregated audience rating metric across the titles:

# Plot Histogram of Audience Rating Distribution
ggplot(criterion_filtered, aes(x = combined_rating)) +
  geom_histogram(binwidth = 0.2, fill = "steelblue", color = "white") +
  geom_vline(aes(xintercept = mean(combined_rating, na.rm = TRUE)), 
             color = "#e74c3c", linetype = "dashed", size = 1) +
  labs(title = "Criterion Standard: Overall Distribution of Audience Ratings",
       subtitle = "Dashed red line indicates the collection mean",
       x = "Average Audience Rating",
       y = "Frequency (Number of Titles)")

The average audience rating is robust and healthy, with very few titles at all below a 6.5. At the same time, the mean average audience rating is not inordinately skewed towards the high end of the rating scale; while this could be initially surprising, given the collection’s reputation, it ultimately makes logical sense given the breadth of titles selected by Criterion, often including titles that are historically important, but not particularly mainstream or audience friendly.

Geographical Representation

Following this, I create a treemap showcasing the countries of origin amongst the titles in the collection:

# Calculate Country Counts
country_counts <- criterion_filtered %>%
  separate_rows(country, sep = ",\\s*") %>%
  group_by(country) %>%
  summarise(count = n()) %>%
  filter(count > 2)

# Plot Country Count
ggplot(country_counts, aes(area = count, fill = count, label = country)) +
  geom_treemap() +
  geom_treemap_text(colour = "white", place = "centre", grow = TRUE) +
  scale_fill_gradient(low = "#34495e", high = "#e74c3c") +
  labs(title = "Criterion Distribution by Country",
       subtitle = "Size indicates number of titles in the filtered collection")

The resulting visualization confirms the assumption that the majority of titles are sourced from the countries with the longest and most storied cinematic histories: USA, France, Japan, Italy, UK, and Germany.

Sorting Modern Titles by Metascore

Lastly, to compliment the audience metrics, I investigate the Metacritic data values. Metacritic is a website that aggregates film critic reviews into a single value on a scale from 0 to 100. However, while its a great metric to assess the critical consensus around modern films, Metacritic is often missing data for titles released before the year 2000. Since the majority of the titles in the collection are classics and released before 2000, this metric will not be useful for our calculations and analysis focused on the collection as a whole. I decide to generate a table of the top 10 titles with the highest Metacritic value that were released in the 21st century:

Top Reviewed 21st Century Criterion Titles
Title Director Year Metascore Avg Audience Rating
Boyhood Richard Linklater 2014 100 7.6960
Pan’s Labyrinth Guillermo del Toro 2006 98 7.9800
Parasite Bong Joon-ho 2019 97 8.4970
4 Months, 3 Weeks and 2 Days Cristian Mungiu 2007 97 7.7230
Roma Alfonso Cuarón 2018 96 7.6115
WALL·E Andrew Stanton 2008 95 8.2545
Portrait of a Lady on Fire Céline Sciamma 2019 95 8.0515
The Irishman Martin Scorsese 2019 94 7.7000
Marriage Story Noah Baumbach 2019 94 7.8170
45 Years Andrew Haigh 2015 94 6.8525

Drawing on my professional background in film post-production, the composition of this top 10 aligns with industry consensus for critically acclaimed 21st-century cinema. However, a divergence is visible between the audience ratings and critic ratings for these 10 titles, with the three titles that seem to holding significant cross-over appeal being Portrait of a Lady on Fire (2019), WALL-E (2008), and Parasite (2019).

With a firm comprehension of the dataset’s foundational distribution, I proceed to begin analysis on two particular areas of interest: the rating, title count, and popularity of genres and directors in the collection.

6. CONCLUSIONS AND RECOMMENDATIONS

Strategic Acquisitions by Genre

Utilizing the analysis on genre-based expansion to the Criterion Collection, and weighing our Aggregated Audience Rating, market scarcity and technical necessity, I recommend the following acquisitions:

Acquisition Strategy
Genre Justification Recommended Titles
War Average popularity (780 avg. votes), highest audience rating (7.52), low current collection volume (75 titles) Incendies (2010), Das Boot (1981), Underground (1995)
History Average popularity (688 avg. votes), second highest audience rating (7.47), low current collection volume (88 titles) To Live (1994), Judgment at Nuremberg (1961), Quo Vadis Aida? (2020)
Mystery High popularity (1070 avg. votes), above average audience rating (7.28), relatively low current collection volume (94 titles) Memento (2000), Cache (2005), Laura (1944)

Genre: War

For the War genre, I selected three titles that are highly rated by audiences, with all currently residing in Letterboxd’s Top 150 films of all time. Incendies (2010) is the highest rated film by one of the most well known and acclaimed filmmakers working today, Denis Villeneuve, and yet is relegated to a legacy 2K transfer on a standard issue Blu-ray. Das Boot (1981) is a popular and highly rated classic, yet similarly is relegated to an antiquated standard Blu-ray release that also fails to include the definitive mini-series cut of the film. Underground (1995), while less of an immediately recognizable title, is just as highly rated as the aforementioned two; it was released a few years ago on Blu-ray by market peer Kino Lorber, but has since gone out-of-print, suggesting their rights to the title have likely expired.

Genre: History

With the History genre, I applied a similar rationale. Judgment at Nuremberg (1961) is an iconic classic, and had a prior Blu-ray release by Kino Lorber, but that release has been out-of-print for some time. To Live (1994) is a beloved and devastating film, and from a director already present in the collection, but has never been released on Blu-ray or 4K UHD in the United States. Quo Vadis Aida? (2020) is a more recent title, but is one of the highest rated titles of the decade on IMDb and Letterboxd; yet, it has never been given a physical media release of any kind in this country.

Genre: Mystery

For the Mystery genre, I selected three equally iconic titles, all from directors with a title already present in the Criterion Collection. Memento (2000) was the breakout hit of Christopher Nolan, whose filmography has defined the modern prestige blockbuster, but is one of his only titles to have never been released on 4K UHD. Cache (2005) is a chilling modern masterpiece from Michael Haneke, highly rated by fans, yet has never received a Blu-ray or 4K UHD release in the United States. Laura (1944) is one of the most famous film noirs of all time by Otto Preminger, but has only been released on an out-of-print Blu-ray which sells for high prices on the secondhand market.

Strategic Acquisitions by Director

While genre-based expansion addresses thematic gaps in the collection, a director-centric approach identifies the auteurs whose individual bodies of work drive the highest Aggregated Audience Rating. Following a similar logic of market scarcity and technical necessity, I recommend the following acquisitions:

Acquisition Strategy
Director Justification Recommended Titles
Bong Joon-ho Highest popularity (9.8K avg. votes), high audience rating (7.96), low current collection volume (3 titles) Mother (2009), The Host (2006), Snowpiercer (2013)
Billy Wilder High popularity (2.2K avg. votes), second highest audience rating (8.11), low current collection volume (3 titles) Witness for the Prosecution (1957), The Lost Weekend (1945), The Seven Year Itch (1955)
Fritz Lang Above average popularity (862 avg. votes), high audience rating (7.69), relatively low current collection volume (4 titles) Dr. Mabuse Trilogy (1922-1960), The Woman in the Window (1944), Fury (1936)

Director: Bong Joon-ho

Bong Joon-ho is a modern master, effectively bridging the gap between arthouse world cinema and mainstream American genre filmmaking. Despite his high level of popularity, quite possibly being the most famous international director working today, he has a relatively small filmography. Yet, amongst his smaller filmography, two of his most iconic and beloved titles have never been released on 4K UHD or by a boutique Blu-ray label, Mother (2009) and The Host (2006), and another iconic title has received a 4K UHD release that is no longer in-print, Snowpiercer (2013). These all seem to be a natural fit for the Criterion Collection.

Director: Billy Wilder

Billy Wilder is one of the auteur titans of classic Hollywood cinema. The majority of his titles that are not already licensed by Criterion are currently in-print from Kino Lorber, however, most of these releases are decades old 2K transfers and are in need of a new 4K restoration and release. Of these, I recommend Witness for the Prosecution (1957), one of his most popular and highly rated titles, and The Lost Weekend (1945), one of Wilder’s two titles to win the Academy Award for Best Picture. There is precedent for Criterion releasing a Billy Wilder title that Kino Lorber had released prior: Some Like It Hot (1959) was released by Kino on 4K UHD in 2022, only for their rights to expire and for Criterion to release their own 4K of the title in 2025. I would also recommend Criterion look into acquiring the title The Seven Year Itch (1955), given it is fairly popular, contains one of Hollywood’s most iconic images in Marilyn Monroe in her white skirt, and has no current boutique release.

Director: Fritz Lang

While Fritz Lang may have a lower popularity metric than Wilder or Bong, his popularity metric still is fairly robust, outranking multiple modern directors and being the third highest director to originate from the silent era, only trailing Charlie Chaplin and Alfred Hitchcock. Lang has a deep filmography, spanning from silent German epics to American noir, and while many of his iconic titles have been already licensed by Crierion and Kino Lorber, there are still quite a few currently unavailable or unreleased. Dr. Mabuse, the Gambler (1922), a fusion of both silent epic filmmaking and noir, remains one of Lang’s most well known silent titles. It was released on Blu-ray a decade ago by Kino, but that release has long been out-of-print. As Criterion themselves have released its sequel, The Testament of Dr. Mabuse (1933), in the past on DVD, there’s a great opportunity to release a Blu-ray boxset of the two titles together, along with the third Mabuse Trilogy entry, The Thousand Eyes of Dr. Mabuse (1960). Alongside a potential Mabuse Boxset, two of his best noirs, The Woman in the Window (1944) and Fury (1936), remain out-of-print from Kino and Warner Archive respectively, and would be ideal additions to the collections as 4K UHD upgrades.

By synthesizing algorithmic unnesting, statistical reliability thresholds, and deep domain expertise in film history, this analysis provides a data-driven roadmap for future Criterion Collection acquisitions. These recommendations ensure that as the collection expands, it continues to balance its legacy of global prestige with the evolving demands of the modern home cinema audience and cinephiles everywhere.

7. DATA SOURCES AND REFERENCES

  • Initial Master List: The Complete Criterion Collection (via Letterboxd, Curated by Josh, Updated March 16, 2026)

  • Supplementary Metadata: Aggregated via the TMDb and OMDb APIs, using the ‘httr’ and ‘jsonlite’ R packages.

  • Market Intelligence: Release history, transfer specifications, and rights-status verified via Blu-ray.com.

  • Criterion Collection: Official spine data and catalog verification via Criterion.com.