How to name your baby using machine learning
The dataset
Although I wanted to create a name generator, what I really ended up building was a name predictor. I figured I would find a bunch of descriptions of people (biographies), block out their names, and build a model that would predict what those (blocked out) names should be.
Happily, I found just that kind of datasethere, in a Github repo calledwikipedia-biography-datasetby David Grangier. The dataset contains the first paragraph of 728,321 biographies from Wikipedia, as well as various metadata.
Naturally, there’s a selection bias when it comes to who gets a biography on Wikipedia (according toThe Lily, only 15% of bios on Wikipedia are of women, and I assume the same could be said for non-white people). Plus, the names of people with biographies on Wikipedia will tend to skew older, since many more famous people were born over the past 500 years than over the past 30 years.
To account for this, and because I wanted my name generator to yield names that are popular today, I downloaded the census’smost popular baby namesand cut down my Wikipedia dataset to only include people with census-popular names. I also only considered names for which I had at least 50 biographies. This left me with 764 names, majority male.
The most popular name in my dataset was “John,” which corresponded to 10092 Wikipedia bios (shocker!), followed by William, David, James, George, and the rest of the biblical-male-name docket. The least popular names (that I still had 50 examples of) were Clark, Logan, Cedric, and a couple more, with 50 counts each. To account for this massive skew, I downsampled my dataset one more time, randomly selecting 100 biographies for each name.
Training a model
Once I had my data sample, I decided to train a model that, given the text of the first paragraph of a Wikipedia biography, would predict the name of the person that bio was about.
If it’s been a while since you’ve read a Wikipedia biography, they usually start something like this:
Because I didn’t want my model to be able to “cheat,” I replaced all instances of the person’s first and last name with a blank line: “___”. So the bio above becomes:
Alvin is a fictional character in the Fox animated series…
This is the input data to my model, and its corresponding output label is “Dale.”
Once I prepared my dataset, I set out to build a deep learning language model. There were lots of different ways I could have done this (here’s one example inTensorflow), but I opted to useAutoML Natural Language, a code-free way to build deep neural networks that analyze text.
I uploaded my dataset into AutoML, which automatically split it into 36,497 training examples, 4,570 validation examples, a 4,570 test examples:
To train a model, I navigated to the “Train” tab and clicked “Start Training.” Around four hours later, training was done.
So how well did the name generator model do?
If you’ve built models before, you know the go-to metrics for evaluating quality are usually precision and recall (if you’re not familiar with these terms or need a refresher, check out this nice interactivedemomy colleagueZack Akilbuilt to explain them!). In this case, my model had a precision of 65.7% and a recall of 2%.
But in the case of our name generator model, these metrics aren’t really that telling. Because the data is very noisy — there is no “right answer” to what a person should be named based on his or her life story. Names are largely arbitrary, which means no model can make really excellent predictions.
My goal was not to build a model that with 100% accuracy could predict a person’s name. I just wanted to build a model that understoodsomethingabout names and how they work.
One way to dig deeper into what a model’s learned is to look at a table called aconfusion matrix, which indicates what types of errors a model makes. It’s a useful way to debug or do a quick sanity check.
In the “Evaluate” tab, AutoML provides a confusion matrix. Here’s a tiny corner of it (cut off because I hadsooomany names in the dataset):
So for example, take a look at therowlabeled “ahmad.” You’ll see a light blue box labeled “13%”. This means that, of all the bios of people named Ahmad in our dataset, 13% were labeled “ahmad” by the model. Meanwhile, looking one box over to the right, 25% of bios of peopled named “ahmad” were (incorrectly) labeled “ahmed.” Another 13% of people named Ahmad were incorrectly labeled “alec.”
Although these are technically incorrect labels, they tell me that the model has probably learnedsomethingabout naming, because “ahmed” is very close to “ahmad.” Same thing for people named Alec. The model labeled Alecs as “alexander” 25% of the time, but by my read, “alec” and “alexander” are awfully close names.
Running sanity checks
Next, I decided to see if my model understood basic statistical rules about naming. For example, if I described someone as a “she,” would the model predict a female name, versus a male name for “he”?
For the sentence “She likes to eat,” the top predicted names were “Frances,” “Dorothy,” and “Nina,” followed by a handful of other female names. Seems like a good sign.
For the sentence “He likes to eat,” the top names were “Gilbert,” “Eugene,” and “Elmer.” So it seems the model understands some concept of gender.
Next, I thought I’d test whether it was able to understand how geography played into names. Here are some sentences I tested and the model’s predictions:
“He was born in New Jersey” — Gilbert
“She was born in New Jersey” — Frances
“He was born in Mexico.” — Armando
“She was born in Mexico” — Irene
“He was born in France.” — Gilbert
“She was born in France.” — Edith
“He was born in Japan” — Gilbert
“She was born in Japan” — Frances
I was pretty unimpressed with the model’s ability to understand regionally popular names. The model seemed especially bad at understanding what names are popular in Asian countries, and tended in those cases just to return the same small set of names (i.e. Gilbert, Frances). This tells me I didn’t have enough global variety in my training dataset.
Model bias
Finally, I thought I’d test for one last thing. If you’ve read at all aboutModel Fairness, you might have heard that it’s easy to accidentally build a biased, racist, sexist, agest, etc. model, especially if your training dataset isn’t reflective of the population you’re building that model for. I mentioned before there’s a skew in who gets a biography on Wikipedia, so I already expected to have more men than women in my dataset.
I also expected that this model, reflecting the data it was trained on, would have learned gender bias — that computer programmers are male and nurses are female. Let’s see if I’m right:
“They will be a computer programmer.” — Joseph
“They will be a nurse.” — Frances
“They will be a doctor.” — Albert
“They will be an astronaut.” — Raymond
“They will be a novelist.” — Robert
“They will be a parent.” — Jose
“They will be a model.” — Betty
Well, it seems the modeldidlearn traditional gender roles when it comes to profession, the only surprise (to me, at least) that “parent” was predicted to have a male name (“Jose”) rather than a female one.
So evidently this model has learnedsomethingabout the way people are named, but not exactly what I’d hoped it would. Guess I’m back to square one when it comes to choosing a name for my future progeny…Dale Jr.?
Thisarticlewas written byDale Markowitz, an Applied AI Engineer at Google based in Austin, Texas, where she works on applying machine learning to new fields and industries. She also likes solving her own life problems with AI, and talks about it on YouTube.