← Back to Projects ML/AI Projects

Fine tuned LLMs for story generation

Fine tuned open source large language models to generate genre conditioned horror stories and fairy tales, then evaluated how well different models match the tone and patterns of the training data.

View project report

Quick Insights:

Introduction

This project explores how open source large language models can be adapted for a focused creative task. We targeted two genres, horror and fairy tales, and two models, Falcon 7B Instruct and Llama 2 7B. The goal was to understand both how to train these models on a small domain specific dataset and how their behavior changes after fine tuning.

We used Project Gutenberg, a public domain digital library of books in plain text format, as the data source. Books were cleaned and converted into instruction style prompt and response pairs. A typical training row has a prompt such as "Write me a horror story" or "Write me a fairy tale" and a roughly 50 line excerpt from a matching story as the response. This structure lets us reuse the same dataset across different instruction tuned models.

Dataset and Prompt Formatting

The dataset started as raw text files from Project Gutenberg for both genres. We removed boilerplate, normalized the content and split each book into self contained story slices. Each slice was tagged with its genre and turned into an instruction style training example.

Sample generated story (Horror, Falcon 7B tuned)

"It is the story of the old man who lived alone in the woods,"
answered the boy. "He had no friends, and no one ever came to
visit him. He was very lonely, and one night he heard a voice
calling him. He opened the door, and there stood a tall,
gray-haired man, with a white beard. He said he had come to take
him away, and that he was to go to a better place than he had ever
been to before. He took the old man by the hand, and led him away
into the woods. They walked all night, and when they reached a
little lake, the man said to the old man, "Now, I am going to tell
you a story. You have heard of the devil, and you know that he lives
in hell."

This sample shows how the tuned Falcon model adopts genre appropriate setting, character choices and mood for horror stories using the instruction based training.

Evaluation and results

We compared original data, base models, partially trained Falcon, Falcon tuned for three epochs and Llama 2 after one epoch. For each configuration we generated 100 stories and analyzed them with three complementary methods.

Overall, Falcon tuned with PEFT and LoRA moved closer to the training distribution. Horror stories from Falcon were more likely to have negative endings and fairy tales more likely to have positive tone, while Llama 2 often pushed both genres toward more positive outcomes.

Key outcomes and learnings

This project combined modern LLM tooling, parameter efficient fine tuning and lightweight evaluation methods into a single workflow for creative text generation.