From birthday wish to hackathon: Speech to Excel
OR the story how I get to feel like a 10x engineer
TLDR
Humans create myths - and in software engineering we have the mythical concept of a “10x Engineer”. It simply defines as a
“developer 10x more productive than most other competent developers”.
For this blog, the key takeaway is the feeling of having superpowers. When I started to pair-programing with my team of 10 opened chrome tabs of GPT experts - really felt like a 10x engineer.
Why this feeling? Cause you feel like you can fly - which is even better than actually flying. You just know there is no integration abyss you cannot overcome, that even the most mundane data translations would be a swift layover, or there ain’t a stack trace deep enough you can’t climb (with the 100K token models especially). Just knowing this for a fact gets you so much dopamine. No more of getting stuck and of that “I am gonna motivate myself in the microkitchen” stuff.
WOW
This blogpost is the story of how that happened for me, sharing my usual prototyping workflow with type of prompts I do for insights.
Prosaic Intro which you can skip
This engineering project had an unlikely start - a long-weekend coinciding with my spouses birthday:
She: For my wish, you know Peter, you a pretty good engineer …
Me: [#homerdisappear]
She: … and there is this thing I keep mentioning …
Me: [#intensifies]
She: … you remember when I was doing that annoying meeting note-taking work and you made this magic tool to transcribe it through GCP some years ago?
Me: Yep, it didn’t quite work.
She: Well, I heard the tools got better, while my problem is also more painful. I go to many events returning with my brain buzzing with all the conversations. And you know how much I care about relationships - so I obviously want to follow up next-day. But only if I would remember what we talked about.
Me: Hmm, maybe we can automate that. What would you like?
She: I would like to have something like the partners at my firm - an assistant they can just call and do the information organization for them. Maybe I record something and that ChatGPT summarize it.
Lets get started
First, lets just try what’s out there for our task:
Input: A voice memo recording as mp4
Output: Excel row for each person with contact notes as columns
I like to get my hands dirty right away - to better understand the problem. Having not much expertise Google suggests using OpenAI Whisper for transcription, so I check their API, which looks modern, therefore I just write a short script and call it (Note: I tried to generate it but ChatGPT from 2021 didn’t know the ChatGPT API from 2023).
Running the script yielded “fileformat not supported”. WTF OpenAI, your docs say you support .mp4. I guess you move so fast that your docs are already outdated! Luckily, I can just use ffmpeg to convert to mp3. Now it indeed works pretty well!
Next I take that output and prompt: “Given the transcript, generate these 10 columns defined as follow” - slowly the response reveals what I was looking for and I am feeling like done:
Me: “Hey honey, look here is your present”.
She: Oh nice! How can I do it myself?
Me: Oh you can just send it to your computer, run this python script, you get a csv, and in your excel you just import special → add row and you done. Dead simple.
She: [not a happy face] - I mean this can be useful for other people too.
One face is thousand words - so clearly I got to automate this.
Design work [technical]
TBH it’s been a LONG time since I built something ground up. I was thinking of having a simple domain, and something in AWS to run ffmpeg and GPT queries. I felt stuck - but luckily, I had my local expert at hand to figure it out alongside my good DevOps guru friend in town:
Me: I need to build a system where users send an MP4, and they get an Excel file back via email.
ChatGPT: Have you considered direct file uploads for a better user experience?
Me: Developer ease is key for me, I don't want to host any servers.
[cut for brevity, see appendix]
Design Review
They alerted me that ChatGPT is a like a highschooler - so lets spend some effort reviewing the resulting design.
I reviewed our requirements:
recording voice [up to 20 minutes]
run transcription [up to a minute]
run customized gpt prompt [up to a minute]
send the structured output
served in cloud
Recognizing that from these, only part 3 is our “core custom business logic”. Ideally, part 3 is the only part we need to develop ourself. The other parts of 1, 2, 4, 5 can be hopefully re-used from the wild.
The design would be then:
having Apple voice memos to record
will get that to cloud via email w/ attachment → lambda → email
the 2, 3, 4 logic would be served out of that lambda
What I liked:
it was really minimalistic,
plus the entire thing can be processed asynchronously (with retries)
without any need of frontend or a server.
AND lastly easy to explain: it’s like sending your secretary a voicemail.
Implementation - The birth of a 10x Engineer
<history-time> Back to my elementary school times. I still remember the thrill when I was learning to code - it was kinda like discovering legos on steroids - as you could build out your ideas quite well.
After my university, with me entering the workforce, there was always the tradeoff between my childish builder passion and our business needs. Slowly, coding become more of “work which needs to get done”. </history-time>
At present time - prototyping with AI agents is like coding on steroids again. Or how my friend said it: “it’s just much closer to speed of thought”.
For the prototype implementation itself, I would say 85% of the coding job was automated. The job was split so that AI agents wrote most of the logic - while I brought in the intent, supervision and maybe most importantly clicking the right buttons. See Appendix for details.
Conclusion
Great job us - we got what we wanted!
But the true learning here is from the process we went through, and not he solution we created. Here we become trailblazers and not just mere hikers following a path. We got dropped at a random spot in the wild, and using magical tools we made our own path. Going through this Unsupervised Journey we gained confidence so we can blazer our paths repeatedly.
So I encourage you too, go embark on a side project for your dear, and learn to fly with those AI agents.
== THE END ==
Appendix 1: Rest of the design discussion
[repeated] Me: Developer ease is key for me, I don't want to host any servers.
ChatGPT: You could use SendGrid to process incoming emails and trigger a webhook. Let me break down the steps.
Me: Tell me more about the email parsing.
ChatGPT: SendGrid has an inbound parse feature. I can also suggest alternatives.
Me: Hm, that's too much novelty for me to implement. Can we do the entire system in AWS?
ChatGPT: You can set up the services like S3, SES, Lambda, and SQS, and I outlined the procedure.
Me: Instead of Excel, how about a Google Sheet?
ChatGPT: Use the Google Sheets API from Lambda. It allows creation, populating, and sharing.
Me: Can I get a sample Python code for that?
ChatGPT: Sure, I've provided an outline for authenticating, creating sheets, populating from CSV, and sharing with users.
Me: Please summarize the discussed design in bullet points ordered by the user experience.
Appendix 2: List of AI agent Chat Threads
Setup a new domain (should have used Namecheap)
Emails to katka.ai redirected
Convert Audio to MP3
SES Email Setup on AWS
Email Attachment from SES
Parse SES Email Attachments
Lambda function which can handle ffmpeg and Python code
CSV to XLS Conversion [stretch, didn’t happen]
Excel with Multiple Sheets [stretch, didn’t happen]
Convert Python objects to JSON
HTML code for this table
Sassy Follow-up style
Parsing Email Timestamps
Why this error in my lambda script? [a LONG thread indeed]
Email Alert on Lambda Failure.
Extract persons in the transcript