Introduction

I believe reading books is one of the most important ways that one can learn from others. Books, primarily physical (which includes paper, e-books), is often times a wealth of knowledge. One of the challenges with reading books is how one takes notes. Not only that, ask yourself - “What do I remember from a book I read a year ago?” and it’s likely the case you won’t really remember all that much. There are certain books like 7 habits of Highly Effective People [1], Getting Things Done [2], and other books are best referred to on at least a yearly basis.

I’ve been doing this for awhile now, but today I decided to optimize my pipeline and prepare it for others on how you can accomplish something similar with the use of AI.

There’s one warning I have to give before continuing. Because of the insanity regarding DRM on books, especially eBooks, doing any of this does require you to get around the DRM in some fashion - whether that’s “sailing the high seas” (piracy) or breaking the DRM. Technically both are illegal [3]. I’m not going to specify how you approach this, but am simply stating that it’s a requirement in some fashion. I fully encourage buying the book(s) you intend to do this with. I have a library of over 1400 books (physical books, far larger if I count ebooks), and over 100 audio books from Audible. That said, I will say I absolutely hate the way that DRM is handled and how this process is much harder because of it.

How it works - high level

At a high level, we take a book, chunk it up into discrete components (ideally by chapter, but depending on the size you may have to go lower), and then summarize each component using AI. Then, we combine the parts, to create an overall book summary and overall terminology.

Graphically, it looks like the following:

Each phase will be discussed below. The source code, like always, is referenced near the end.

For this book, I’ll reference a book I purchased this weekend, and started reading. It’s called: Quit, by Annie Duke [4].

Definition Phase

In the “Definition” phase we have 3 things we need to do. The first, and most obvious part, is that we need a digital copy of the book that can be read without DRM. My ebook reader support ePub so I prefer that anyways, since it’s more often I have my ebook reader with me than my normal hardbound book and find that is the best format to start with.

Next, we need to create a PDF of said book. Personally, I’m a big fan of the program called Calibre [5]. It can be used to manage physical books and eBooks, options exist for web interfaces and the like. It’s fantastic software. Either way, once the .ePub version is loaded, you can convert it into a number of formats, including PDF. That’s what we need to do first.

Next, we need to define the architecture. This is all encoded in a script, and an full example is in the source code, but for the definition we have the following:

sections = [
    Section(number=1, title="The Case for Quitting"),
    Section(number=2, title="In the Losses"),
    Section(number=3, title="Identity and Other Impediments"),
    Section(number=4, title="Opportunity Cost")
]

chapters = [
    Chapter(title="The Opposite of a Great Virtue Is Also a Great Virtue",
            number=1,
            section=sections[0],
            start=20,
            end=35),
    Chapter(title="Quitting On Time Usually Feels like Quitting Too Early",
            section=sections[0],
            number=2,
            start=36,
            end=55),
    Chapter(title="Should I Stay, or Should I Go?",
            section=sections[0],
            number=3,
            start=56,
            end=72),
    Chapter(title="Escalating Commitment",
            section=sections[1],
            number=4,
            start=74,
            end=83),
    Chapter(title="Sunk Costs and the Fear of Waste",
            section=sections[1],
            number=5,
            start=84,
            end=101),
    Chapter(title="Moneys and Pedestals",
            section=sections[1],
            number=6,
            start=102,
            end=122),
    Chapter(title="You Own What You've Bought and What You've Thought: Endowment and Status Quo Bias",
            section=sections[2],
            number=7,
            start=124,
            end=141),
    Chapter(title="The Hardest Thing to Quit Is Who you Are: Identity and Dissonance",
            section=sections[2],
            number=8,
            start=142,
            end=159),
    Chapter(title="Find Someone Who Loves You but Doesn't Care about Hurt Feelings'",
            section=sections[2],
            number=9,
            start=160,
            end=176),
    Chapter(title="Lessons from Forced Quitting",
            section=sections[3],
            number=10,
            start=178,
            end=196),
    Chapter(title="The Myopia of Goals",
            section=sections[3],
            number=11,
            start=197,
            end=212)
]

This process is quite manual. In Calibre, you can right click on the PDF link within Calibre to open in the default application (in my case, preview.app)

Once it’s open in your PDF application, you have to define each of the above. This is a bit time consuming, each chapter has a title, a chapter number (number), start and end. Optionally, it can contain a section. This book has 3 discrete sections.

AI Phase

In the AI phase, we have three discrete steps. To avoid making this post too long, at a high level LangChain is used like other projects I mentioned here. It uses PyPDF2 to read the PDF file, extracting the text from each page within the range for a chapter, then summarizes that single chapter. To keep it generic, I ask the model to provide a minimum of 3 sections, and upwards of 5 sections. This includes the following:

A high-level summary: 1-3 paragraphs that can serve as an executive summary.
Topics Discussed: A bulleted list of items with description.
Takeaways: The most important items that a person should take away from the corresponding text.
Recommended Activities: If there are recommended activities, to provide them.
Terminology: If there’s more advanced terminology (either specific to the text, or specialized), then include those definitions.

It repeats the above process for each chapter individually. The largest reason for this is due to context size, but there’s little reason to do the whole book at once even if it could be done.

After this step is done, the system then pushes all that information in once more to general a general book summary. Check the lib/summarize.py in the example code to see the details on how I deal with the prompt.

The script outputs intermediate files for each of these, and also keeps it in memory.

Post Processing

After everything’s processed individually, there’s a few post processing steps that happen. First, we output the entire markdown file as full_book_summary.md in the output directory. Now, if we pass in the options to make a PDF or a ePub file, those are also created with the name of the book. These use Pandoc to handle this process. In the end, we’re given files we can consume later, either into a destination system or print out if desired.

Summary

I encourage you to look at the source code if you’re interested in this. You can check out the repository, and as long as you have an environment setup and the libraries installed, you should be able to run this without issues. Personally, I stick the Markdown into Obsidian, and look it it periodically, potentially supplementing and/or cutting information.

Source code

You can view an example of the output from the time I ran this from Here

You can view the Github repo Here

About Me

Pages

Articles

Home

AI Generated Summaries

August 04, 2025

Introduction

How it works - high level

Definition Phase

AI Phase

Post Processing

Summary

Source code

References

David Thole

Book Review - Designing the Mind

AI Generated Assessments

Current AI Stack and Overview