Interpolated Text Embeddings

A Tiny Project With Text Embeddings

I use “Nanogenmo” (National Novel Generation Month, November, distinct from NaNoWriMo) as an excuse to try computational narrative projects, increasingly with AI/ML help (but more interesting than just “generate me a novel.”) I’ve been wanting to play harder with semantic embeddings of text, using public domain fiction. The project for 2024 was a silly one, but featured one fun highlight, text interpolation (without generating anything new). The idea of latent interpolation has been around for a while, but I thought I’d show some examples for text and some how-to.

The structure of my project was:

  1. Download the top 100 books from Gutenberg Books (filter out non-English, Victorian porn (it exists), dupes)
  2. Clean text in various ways, split into sentences.
  3. Embed with a fast, small, local model
  4. Put in a db
  5. Query the db, use the closest sentence result to query again…
  6. Stitch together all the sentences into a “story” (a bad one)

For the quickie for Nanogenmo, I queried from “Once upon a time” and chained results: i.e., I used the closest sentence as the input for the next search, and the result of that search for the next search, and so on (preventing duplicate hits).

To get the ending, I started from “They lived happily ever after.” and chained from that… then reversed the output list, so it would end with “They lived happily ever after” and start with the last sentence I got. I did this up to the word count needed (50K, so 25K for each half). Then, to connect the 2 sections, I did some interpolation between the end sentence of the first half (the “once upon a time” chain), and the start of the ending segment (the “happily ever after” section). This was the “glue.”

The results were not… good. But the interpolation part was the most fun. I’ll show some example results, and then give some technical tips and links below.

Interpolation… Sort of works!

For a tiny demo example (which appeared in my talk slides from the Creative Narrative Workshop intro I gave in Dec 2024), I tried these 2 sentences, mixing time of day and temperature. I made a little graphic, showing the progression of the concepts:

That example worked better than many others I tried, and I was in fact a little bemused by the repeat of the authors (a property of their writing? the embedding model?).

Some more examples…

Between “They went by the ocean” and “They went over the mountain tops” (distance score in parens):

After they had washed them and got them quite clean, they laid them out by the sea side, where the waves had raised a high beach of shingle, and set about washing themselves and anointing themselves with olive oil.

The Odyssey: Rendered into English prose for the use of those who cannot read the original, Homer, 0.355

They moored the vessel a little way out from land, and then came on shore and went to the house of King Alcinous.

The Odyssey: Rendered into English prose for the use of those who cannot read the original, Homer, 0.379

So they went to the mountains; and as it was a lovely day, they stayed there till the evening.

Grimms’ Fairy Tales, Grimm, Jacob, Grimm, Wilhelm, 0.367

At length they entered a path which, going out from the road, skirted the mountainside.

Twenty years after, Dumas, Alexandre, Maquet, Auguste, 0.370

Thus talking they reached the foot of a high mountain which stood like an isolated peak among the others that surrounded it.

Don Quixote, Cervantes Saavedra, Miguel de, 0.381

Between “In the sky” and “In a hole”:

I might have said, "Where is it?" for it did not seem in the room—nor in the house—nor in the garden; it did not come out of the air—nor from under the earth—nor from overhead.

Jane Eyre: An Autobiography, Brontë, Charlotte

There’s nothing situate under heaven’s eye But hath his bound in earth, in sea, in sky.

The Complete Works of William Shakespeare, Shakespeare, William

Within the heaven, Where peace divine inhabits, circles round A body, in whose virtue dies the being Of all that it contains.

The divine comedy, Dante Alighieri

He had found a wonderful spot, a sort of natural hollow in a rock, with an entrance like a doorway between two boulders.

Dracula, Stoker, Bram

What a paltry place!’ said she; ‘to whom does that little dirty hole belong?'

Grimms’ Fairy Tales, Grimm, Jacob, Grimm, Wilhelm

Between “Safe inside by the hearth” and “out in the wild scary forest”:

Within is a fireplace, black and smoky, and usually unsteady with age.

The Souls of Black Folk, Du Bois, W. E. B. (William Edward Burghardt), 0.514

At the end of this he threw open a heavy door, and I rejoiced to see within a well-lit room in which a table was spread for supper, and on whose mighty hearth a great fire of logs, freshly replenished, flamed and flared.

Dracula, Stoker, Bram, 0.531

When it was bed-time, and the others went to bed, the mother said to the bear: ‘You can lie there by the hearth, and then you will be safe from the cold and the bad weather.

Grimms’ Fairy Tales, Grimm, Jacob, Grimm, Wilhelm, 0.535

And from this time forth everyone could again go into the forest with safety.

Grimms’ Fairy Tales, Grimm, Jacob, Grimm, Wilhelm, 0.422

There was once a forester who went into the forest to hunt, and as he entered it he heard a sound of screaming as if a little child were there.

Grimms’ Fairy Tales, Grimm, Jacob, Grimm, Wilhelm, 0.476

And “hovels and huts” vs. “castles and palaces and opulence” (note the second one is a scene description).x

Betwixt the hut and the fence, on the back side, was a lean-to that joined the hut at the eaves, and was made out of plank.

Adventures of Huckleberry Finn, Twain, Mark, 0.458

A part of the Heath with a Hovel Scene V. A Room in Gloucester’s Castle Scene VI.

The Complete Works of William Shakespeare, Shakespeare, William, 0.458

Gardens, convents, timber-yards, marshes; occasional lowly dwellings and great walls as high as the houses.

Les Misérables, Hugo, Victor, 0.462

Ruined castles hanging on the precipices of piny mountains, the impetuous Arve, and cottages every here and there peeping forth from among the trees formed a scene of singular beauty.

Frankenstein; Or, The Modern Prometheus, Shelley, Mary Wollstonecraft, 0.469

If you have built castles in the air, your work need not be lost; that is where they should be.

Walden, and On The Duty Of Civil Disobedience, Thoreau, Henry David, 0.429

Not for her were mediaeval castles, even those that are specially described as small.

The Enchanted April, Von Arnim, Elizabeth, 0.452

Some Observations

The embedding was a tiny model (“gguf/mxbai-embed-xsmall-v1-q8_0”, I didn’t test others as a comparison), and the texts available were not superb… I would redo this project with a deeper set of fiction I like, rather than the random top 100 books. I’m working on that.

Things we humans might think of as “mid points” between concepts aren’t always so clear to the embedding “logic.” Like, palaces vs. huts, or oceans vs. mountains: fields don’t really show up there, for instance. I was just lucky in the first illustrated example with the time of day and temperatures. Also, embeddings of sentences get heavily biased by some vocabulary words, especially proper names. This was a rat-trap in the “Once Upon a Time” chaining; some large number of sentences that were the “next most similar” came from the same novel with the same character names, since I didn’t put in a penalty for using the same book. You want to clean out proper names from the input text if you don’t want that kind of semantic search hit.

Tech How-To Tips

The embeddings were done with Simon Willison’s llm library, because I wanted to try it and thought I could work from sqlite (which I did not end up doing). I also wanted to try a gguf model, based on his post here about a teeny model.

However: I had an error with the llama wrapper on my Mac and switched to a remote linux VM, which didn’t produce the same error. Just FYI. (None of the tips online for this error worked.)

Afer embedding and dumping into sqlite, I decided to switch databases for various ease of use reasons… and after some flailing, ChromaDB was a good fit for a nice dev experience. It offers some good flexibilit in querying, although I did nothing fancy. I may test out LanceDB next, but didn’t need it for this experiment. I moved the embeddings from sqlite via Pandas, to parquet, and merged metadata from the book info with the embeddings, and loaded it all into Chroma.

A couple of simple functions for the interpolation:

def interpolate_vectors(v1, v2, weight):
    """
    Interpolate between two vectors with given weight.
    weight = 0 returns v1, weight = 1 returns v2
    """
    return [a + (b - a) * weight for a, b in zip(v1, v2)]

def generate_interpolations(v1, v2, n):
    """
    Generate n interpolations between v1 and v2.
    Returns list of vectors, starting at v1 and ending at v2.
    """
    weights = [i/(n-1) for i in range(n)]
    return [interpolate_vectors(v1, v2, w) for w in weights]

And then the logic to interpolate between 2 sentences… We get the embeddings of the two anchor sentences, interpolate between them math-wise (for n steps, for sentences wanted) and use those intermediate latents to “look up” the closest real sentences in the DB that can be found. In this simplified code, “combine_res” is just adding the metadata about the book to the result. We prevent a sentence from being used twice by tracking seen ids.

def merger(sent1, sent2, n=10):
    # interpolate between 2 sentences to join the parts of the story -- n sentences
    seen_ids = []
    sentences = []

    vector1 = embed_text(sent1)
    vector2 = embed_text(sent2)
    interps = generate_interpolations(vector1, vector2, n=n)

    # don't use these sentences again:
    res1 = combine_res(search_chroma(sent1, n=1))
    seen_ids.append(res1[0]['id'])
    res2 = combine_res(search_chroma(sent2, n=1))
    seen_ids.append(res2[0]['id'])

    for i, interp in enumerate(interps):
        results = combine_res(search_chroma_with_vector(interp, n=15))  # get extras due to avoiding seen ids
        got = False
        for res in results:
            if res['rowid'] not in seen_ids:
                sentences.append(res)
                seen_ids.append(res['rowid'])
                string = res['content']
                got = True
                break
        if not got:
            print("blocked, too few to choose from")
            break
    return sentences

There is more code in the git repo I made for this project, including the gutenberg query and cleaning, plus moving from sqlite to pandas to the ChromaDB loading; but it’s not super well documented or clean, I was (as always) under end of November deadline and doing it last minute. Maybe it’s useful for you, though!