Evals for Diversity in Synthetic Data

Generating synthetic data from LLMs has become a popular approach for bootstrapping an initial dataset when building LLM-based applications.

We can find practical examples of this such as generating synthetic user queries from existing documents to evaluate RAG systems ¹, producing fake meeting transcripts for video call summarization ², or even bootstrapping lots of texts (emails, inquiries, multi-turn chats etc.) for good old classification tasks (customer service routing, intent classification, sentiment analysis, etc.).

A common starting point is that we write a prompt defining the data we need, provide a few seed examples either within the prompt or as few-shot exemplars, and sample multiple times from the LLM to bootstrap a dataset, as shown below.

However, we will find the outputs generated by LLMs out of the box to be repetitive. We can turn to the usual techniques to increase diversity:

Sampling Parameters: higher temperature, nucleus-sampling, top-k sampling, random seeds
Attribute Generation: Generating various attributes (topics, writing style, length, personas, emotion, sentiment, location, etc.) beforehand and inserting randomly sampled attributes in the prompt. (Yu et al. (2023), Ge et al. (2024))
Post-decoding Clustering: Overgenerating a large number of texts and deduplicating via cluster centroids (Ippolito et al., 2019) and semantic hashing (Dongen and Tulkens, 2025)

But this raises the question

How do we systematically test the impact of various techniques above on diversity without relying on just vibe checks?

I was curious about this and read up on the existing academic literature to see if there are any existing evals for diversity. It turns out that there is a large body of prior work on evaluating diversity from the days of classic sequence-to-sequence models and dialogue generation (Shaib et al. (2024a), Guo et al. (2024)).

In this post, I will discuss the various diversity metrics from the literature and explain how they work. These automatic metrics are fast to compute and can be a useful tool to have as a proxy for evaluating linguistic diversity in applied use cases.

Lexical Diversity Metrics

These metrics capture the surface-level repetition of words, phrases, topics, and n-grams in the generations.

Distinct n-grams (Distinct-k)

This simple metric was proposed in Li et al. (2016) to evaluate their loss function that increases diversity in sequence-to-sequence models. It is based on the idea of type-token ratio from linguistics.

It calculates diversity as the ratio of the number of unique n-grams to the total n-grams occurring in the entire generated dataset. As shown below, the two texts contain only 5 unique unigrams out of a total of 9 unigrams and thus, the diversity score is only 55% (0.55).

However, if all the synthetic texts were unique, we would get a diversity score of 100% (1.0).

This same concept can be extended from unigrams to bigrams, trigrams, and any higher-order n-grams.

One approach we can take is to calculate and report the diversity score separately for different n-grams. Li et al. (2016) do this for unigrams and bigrams as distinct-1 and distinct-2. While Padmakumar and He (2023) report diversity scores up to 4-grams separately in their paper that shows instruction-tuned models have lower diversity compared to base models.

An alternate approach is to combine the scores for different n-grams into a single number. Li et al. (2022) take the product of the diversity score for unigrams, bigrams, trigrams, and four-grams as a single final score, while Meister et al. (2023) take the sum of the diversities.

We can use the distinct-k metric via the library diversity by Shaib et al. (2024a) as shown below:

shell

pip install diversity

from diversity import ngram_diversity_score

texts = ['As an AI language model', 'As an AI model']

ngram_diversity_score(texts, 1)

0.556

N-gram Entropy (Ent-n)

This metric was introduced in Zhang et al. (2018) and has also been used by Jagfeld et al. (2018) to evaluate the diversity of template to natural language generation.

The intuition behind it is that in an ideal case, all the texts generated from an LLM would be unique and no n-gram would be repeated more than once.

We can measure this by collecting all the unique bigrams in the text and calculating their count and the relative frequency. This would give us a probability distribution over the bigrams.

For the highest diversity, all the texts would be unique and thus the probability distribution over the bigrams would be uniform, resulting in the highest entropy. Therefore, the entropy of the n-gram distribution can be used as a metric for diversity as seen below.

Given the distribution of bigrams, we can calculate the entropy easily as shown below.

import math

probs = [0.25, 0.25, 0.25, 0.25]
-sum(p * math.log(p) for p in probs)

1.3862

However, let’s take another case where there is lots of repetition e.g. “Play the music” being generated 100 times. In such a case, the bigrams “Play the” and “the music” would have the highest frequency. As such, the entropy would be lower, and thus, the diversity would be lower.

We can also extend this idea to higher-order n-grams similar to the distinct n-grams metric. Tevet and Berant (2020) calculate and report entropy separately for unigram, bigram, and trigrams.

While Oraby et al. (2018) combine all unique unigrams, bigrams, and trigrams and then use the entropy of the resulting distribution as the diversity.

This metric can be implemented in code as shown below.

import math
from collections import Counter


def generate_ngrams(words, n: int):
    return [" ".join(words[i : i + n]) for i in range(len(words) - n + 1)]


def ngram_entropy(texts: list[str], n: int = 2) -> float:
    ngrams = []
    for text in texts:
        words = text.split()
        ngrams.extend(generate_ngrams(words, n))

    ngram_counts = Counter(ngrams)
    total_ngrams = sum(ngram_counts.values())

    ngram_frequencies = [count / total_ngrams for ngram, count in ngram_counts.items()]

    entropy = -sum(freq * math.log(freq) for freq in ngram_frequencies)
        
    return entropy

1: Step 1: Generate n-grams from input texts
2: Step 2: Count the frequency of each n-gram
3: Step 3: Calculate the frequency of each n-gram
4: Step 4: Calculate entropy

texts = ["Call an Uber", "Play the music"]

print("Unigram entropy:", ngram_entropy(texts, n=1))
print("Bigram entropy:", ngram_entropy(texts, n=2))
print("Trigram entropy:", ngram_entropy(texts, n=3))

Unigram entropy: 1.7917594692280547
Bigram entropy: 1.3862943611198906
Trigram entropy: 0.6931471805599453

Normalized N-gram Entropy

The original n-gram entropy metric doesn’t have a fixed range for the score.

To get a score between a range of 0 to 1, I thought of a normalized version inspired by the NDCG metric from Information Retrieval.

For any generated set of texts, the maximum diversity possible happens when all the n-grams occur with the same frequency. Thus, the entropy of a uniform distribution of those ngrams would provide us with the upper bound of diversity.

We can calculate the n-gram entropy as before and then divide it by the entropy of the ideal uniform distribution over the n-grams to get a normalized diversity score between 0 and 1.

import math
from collections import Counter


def generate_ngrams(words: list[str], n: int) -> list[str]:
    return [" ".join(words[i : i + n]) for i in range(len(words) - n + 1)]


def normalized_ngram_entropy(texts: list[str], n: int = 2) -> float:
    ngrams = []
    for text in texts:
        words = text.split()
        ngrams.extend(generate_ngrams(words, n))

    ngram_counts = Counter(ngrams)
    total_ngrams = sum(ngram_counts.values())

    ngram_frequencies = [count / total_ngrams for ngram, count in ngram_counts.items()]
    entropy = -sum(freq * math.log(freq) for freq in ngram_frequencies)

    uniform_frequencies = [1/len(ngrams) for _ in range(len(ngrams))]
    ideal_entropy = -sum(freq * math.log(freq) for freq in uniform_frequencies)

    diversity = entropy / ideal_entropy
    return diversity

We can use it similar to before.

texts = ["Call an Uber", "Play the music"]

print("Unigram diversity:", normalized_ngram_entropy(texts, n=1))
print("Bigram diversity:", normalized_ngram_entropy(texts, n=2))
print("Trigram diversity:", normalized_ngram_entropy(texts, n=3))

Unigram diversity: 1.0
Bigram diversity: 1.0
Trigram diversity: 1.0

Compression Ratio

This metric was proposed by Shaib et al. (2024a), adapting the concept of the compression ratio, originally used to evaluate compression algorithms, as a diversity metric.

Compression ratio calculates the ratio of the size of the compressed file to its original size. If the compression ratio is high, it indicates the file was highly compressible and thus had higher redundancy. This would indicate lower diversity in the file contents.

To apply this concept to texts, we can compress them using an algorithm like Gzip and then calculate the compression ratio. A higher ratio indicates lower diversity in the text. Thus, the greater the compression ratio, the less diverse the generated texts.

Thus, diversity can be calculated as the reciprocal of the compression ratio to get a score between 0 and 1.

\[ \text{Diversity} = \frac{1}{\text{Compression Ratio}} = \frac{1}{16.258} = 0.06 \]

If all the texts are unique, then the compressed file size would be the same as the original file size and thus the compression ratio and the diversity both would be 1.

We can implement this in code using the diversity library.

shell

pip install diversity

from diversity import compression_ratio

texts = ['Call an Uber'] + ['Play the music'] * 100

compression_ratio(texts)

16.258

Semantic Diversity Metrics

These metrics capture the diversity in terms of meaning and rely on embeddings. It is useful to handle cases where the texts generated are all similar in meaning but have zero n-gram overlap.

For example, “Play the music” and “Start a song” have zero word overlap and thus would be incorrectly assigned 100% diversity by lexical metrics. However, they are repetitive in meaning and thus should have been assigned a lower diversity score. Semantic diversity metrics can tackle this.

Embedding Diversity

This metric was proposed in Tevet and Berant (2020) and considers diversity as the dissimilarity of text embeddings.

The idea is to calculate the sentence embeddings of all the generated texts using some encoder (e.g. sentence-transformers).

Then, we calculate the cosine similarity between all the unique pairs and take the average to get a similarity score.

To convert the similarity into diversity, we can either take the negation of the average cosine similarity (Tevet and Berant, 2020) or take the cosine distance i.e. \(1 - \text{cosine similarity}\) (Young et al. (2024); Hayati et al. (2024))

Approach	Mean Cosine Similarity	Diversity	Range
Young et al. (2024) / Hayati et al. (2024)	0.39	1 - 0.39 = 0.61	0 to 1
Tevet and Berant (2020)	0.39	-0.39	-1 to 0

DCScore

This metric was proposed in a paper currently under review for ICLR 2025 (Anonymous, 2024).

The metric, similar to embedding diversity, also starts by calculating the pairwise similarity between all the text embeddings but has a unique take on formulating the diversity.

To understand the intuition, let’s look at the first row of the pairwise similarity matrix. Here, the numbers 1.0, 0.75, and 0.2 mean that the text is 100% similar to itself, 75% similar to some other text, and 20% similar to the final text. Hypothetically, we would have wanted the text to only be 100% similar to itself and 0% similar to everything else for maximum diversity.

Thus, we want some relative measure of similarity of the text to itself in comparison to others. The authors use softmax for this. This would convert the cosine similarities into relative probabilities of the text belonging to itself and others. When we apply softmax, we see that the first text is only belonging 45% to itself. Thus, the softmax probability of the text belonging to itself can be a measure of diversity.

To calculate the diversity of the dataset overall, we simply take the mean of the diagonal of the pairwise matrix after applying softmax. Thus, we get a diversity score of 0.47 in the example above.

The idea is very simple and can be implemented in a few lines of code. We can swap the embedding model with the model of our choice as needed.

shell

pip install sentence-transformers scipy numpy

import numpy as np
from scipy.special import softmax
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def dcscore(texts: list[str]) -> float:
    text_embeddings = model.encode(texts, normalize_embeddings=True)
    pairwise_matrix = text_embeddings @ text_embeddings.T
    softmax_matrix = softmax(pairwise_matrix, axis=1)
    score = np.mean(np.diag(softmax_matrix))
    return score

score = dcscore(['Play the music', 'Start the music', 'Call an Uber'])
print(score)

1: Load the MiniLM Sentence-BERT model
2: Generate embeddings for the sentences
3: Calculate pairwise cosine similarity
4: Apply softmax on the row level for each text
5: Take the mean of the scores in the diagonal

0.47264108

Cluster Inertia

This metric was proposed in Du and Black (2019) and reuses the inertia metric used to compute the quality of clustering as the diversity.

The idea is to cluster embeddings of the LLM-generated texts into 10 clusters and measure the inertia. Inertia is the sum of the squared distances between all points in a cluster and its centroid.

We can treat the inertia as a proxy for diversity because if the texts are diverse, they would be far apart from the centroid and thus the squared distance from the cluster centroid will be larger.

In code, this can be accomplished as shown below:

import numpy as np
from sklearn.cluster import KMeans

# Text embeddings for 1024 synthetic texts
text_embeddings = np.random.rand(1024, 768)

# Run clustering
kmeans = KMeans(n_clusters=10, random_state=42)
kmeans.fit(text_embeddings)

# Get the inertia
k.inertia_

64556.00644871439

Syntactic Diversity Metrics

These metrics capture diversity in terms of the underlying grammatical structure.

Compression Ratio - Part of Speech (CR-POS)

This metric was proposed in Shaib et al. (2024b) to detect the repetition of syntactic templates in LLM-generated texts.

It reuses the idea of Compression Ratio but applies it to syntactic representation instead of the raw text. This works by applying a part-of-speech tagger to the text to get the POS tag for each token.

We apply a POS tagger to all the synthetically generated texts and get their syntactic representation as strings.

Then, the process is the same as the regular compression ratio. We concatenate the POS-tagged strings of all the texts together, compress the text using gzip, and then compare the ratio of the original file size with the compressed file size.

If the compression ratio is high, it indicates a large repetition of syntactic templates in the generated texts. Thus, the diversity will be low.

We can compute diversity directly by taking the reciprocal of the compression ratio and get a score between 0 and 1.

\[ \text{Diversity(POS)} = \frac{1}{\text{Compression Ratio(POS)}} = \frac{1}{13.02} = 0.076 \]

Conclusion

Thus, in this post, we learned about three different linguistic diversity metrics - lexical, semantic, and syntactic.

We have skipped a category of diversity metrics called homogenization scores above as those can be computationally expensive for practical use cases. These work by applying evaluation metrics from machine translation/summarization such as BLEU, ROUGE, etc. on each text treating all other texts as the reference text (Zhu et al. (2018)).

For further deep-dive into diversity metrics, you can read Shaib et al. (2024a) for a comparative analysis of these metrics on various datasets and Guo et al. (2024) for application of the metrics to evaluate popular LLMs.

References

Anonymous. 2024. Evaluating diversity of LLM-generated datasets: A classification perspective. In Submitted to the thirteenth international conference on learning representations. under review.

Thomas van Dongen and Stephan Tulkens. 2025. SemHash: Fast semantic text deduplication.

Wenchao Du and Alan W Black. 2019. Boosting dialog response generation. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th annual meeting of the association for computational linguistics, pages 38–43, Florence, Italy. Association for Computational Linguistics.

Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. 2024. Scaling synthetic data creation with 1,000,000,000 personas.

Yanzhu Guo, Guokan Shang, and Chloé Clavel. 2024. Benchmarking linguistic diversity of large language models.

Shirley Anugrah Hayati, Minhwa Lee, Dheeraj Rajagopal, and Dongyeop Kang. 2024. How far can we extract diverse perspectives from large language models? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 conference on empirical methods in natural language processing, pages 5336–5366, Miami, Florida, USA. Association for Computational Linguistics.

Daphne Ippolito, Reno Kriz, João Sedoc, Maria Kustikova, and Chris Callison-Burch. 2019. Comparison of diverse decoding methods from conditional language models. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th annual meeting of the association for computational linguistics, pages 3752–3762, Florence, Italy. Association for Computational Linguistics.

Glorianna Jagfeld, Sabrina Jenne, and Ngoc Thang Vu. 2018. Sequence-to-sequence models for data-to-text natural language generation: Word- vs. Character-based processing and output diversity. In Emiel Krahmer, Albert Gatt, and Martijn Goudbeek, editors, Proceedings of the 11th international conference on natural language generation, pages 221–232, Tilburg University, The Netherlands. Association for Computational Linguistics.

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting objective function for neural conversation models.

Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and M. Lewis. 2022. Contrastive decoding: Open-ended text generation as optimization. Annual Meeting of the Association for Computational Linguistics.

Clara Meister, Tiago Pimentel, Gian Wiher, and Ryan Cotterell. 2023. Locally typical sampling.

Rodrigo Nogueira and Jimmy Lin. 2019. From doc2query to docTTTTTquery.

Shereen Oraby, Lena Reed, Shubhangi Tandon, S. SharathT., S. Lukin, and M. Walker. 2018. Controlling personality-based stylistic variation with neural natural language generators. SIGDIAL Conference.

Vishakh Padmakumar and He He. 2023. Does writing with language models reduce content diversity? International Conference on Learning Representations.

Chantal Shaib, Joe Barrow, Jiuding Sun, Alexa F. Siu, Byron C. Wallace, and Ani Nenkova. 2024a. Standardizing the measurement of text diversity: A tool and a comparative analysis of scores.

Chantal Shaib, Yanai Elazar, Junyi Jessy Li, and Byron C. Wallace. 2024b. Detection and measurement of syntactic templates in generated text. Conference on Empirical Methods in Natural Language Processing.

Guy Tevet and Jonathan Berant. 2020. Evaluating the evaluation of diversity in natural language generation. Conference of the European Chapter of the Association for Computational Linguistics.

Halley Young, Yimeng Zeng, Jacob Gardner, and Osbert Bastani. 2024. Improving structural diversity of blackbox LLMs via chain-of-specification prompting.

Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander J. Ratner, Ranjay Krishna, Jiaming Shen, and Chao Zhang. 2023. Large language model as attributed training data generator: A tale of diversity and bias. Neural Information Processing Systems.

Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, and W. Dolan. 2018. Generating informative and diverse conversational responses via adversarial information maximization. Neural Information Processing Systems.

Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

Footnotes

Jason Liu has a great conceptual example of using synthetic data for RAG Evaluation. Nogueira and Lin (2019) is another classic paper.↩︎
OpenAI has an example walkthrough on generating synthetic transcripts for a daily standup meeting summarization use-case in their build hour on evals ↩︎

Citation

BibTeX citation:

@online{chaudhary2025,
  author = {Chaudhary, Amit},
  title = {Evals for {Diversity} in {Synthetic} {Data}},
  date = {2025-02-09},
  url = {https://amitness.com/posts/diversity-evals/},
  langid = {en}
}

For attribution, please cite this work as:

Amit Chaudhary. 2025. Evals for Diversity in Synthetic Data.