And now…

…for something completely different.  My wife and I are expecting our first child in a few months, which is wonderful and all, but means that we are faced with the daunting task of coming up with a name.  Being data analysis types (she much more than me), and subscribing to the philosophy that there is no problem that Python can’t solve, we decided to write competing scripts to select a good subset of names.  This is my first crack at a script (which I’ve titled BAMBI for BAby naMe BIas), I’ve also posted the code to Github.  That will stay up to date as I refine my method (in case you too would like Python to name your child).

My general approach was to take the list of baby names used in 2014 and published by the Social Security Agency here, bias against the very rare and very common names (personal preference), then somehow use a combination of our birth dates and a random number generator to create a list of names for further consideration.   Okay, let’s give it a go…

First, define some variables. Their use will be apparent later.  Obviously replace 999999 with the real values.

get = 100 # how many names do you want returned?
wife_bday = 999999
my_bday = 999999
due_date = 999999
aatc = 999999 # address at time of conception
size = (wife_bday + my_bday) / (due_date / aatc)
start_letters = ['V','M'] # restrict names to those that start with these letters, can leave as empty list if no restriction desired
sex = 'F' # F or M

Then import the necessary modules.

import matplotlib
import numpy as np
import matplotlib.pyplot as py
import math
import scipy.stats as sps

Define a couple of variables to hold the names and abundance data, then read the file from the SSA.

p = [] # this will hold abundance
names = [] # this will hold the names
            
with open('yob2014.txt', 'r') as names_in:
    for line in names_in:
        line = line.rstrip()
        line = line.split(',')
        if line[1] == sex:
            if len(start_letters) > 0:
                if line[0][0] in start_letters:
                    n = float(line[2])
                    p.append(float(n))       
                    names.append(line[0])
            else:
                n = float(line[2])
                p.append(float(n))       
                names.append(line[0])

Excellent. Now the key feature of my method is that it biases against both very rare and very common names. To take a look at the abundance distribution run:

py.hist(p, bins = 100)

figure_1Ignore the ugly X-axis.  Baby name abundance follows a logarithmic distribution; a few names are given to a large number of babies, with a long “tail” of rare baby names.  In 2014 Emma led the pack with 20,799 new Emmas welcomed into the world.  My approach – I have no idea if it’s at all valid, so use on your own baby with caution – was to fit a normal distribution to the sorted list of names.  I got the parameters for the distribution from the geometric mean and standard deviation (as the arithmetic mean and SD have no meaning for a log distribution).  The geometric mean can be calculated with the gmean function, I could not find a ready-made function for the geometric standard deviation:

geo_mean = sps.mstats.gmean(p)
print 'mean name abundance is', geo_mean

def calc_geo_sd(geo_mean, p):
    p2 = []

    for i in p:
        p2.append(math.log(i / geo_mean) ** 2)
    
    sum_p2 = sum(p2)
    geo_sd = math.exp(math.sqrt(sum_p2 / len(p)))
    return(geo_sd)
    
geo_sd = calc_geo_sd(geo_mean, p)
print 'the standard deviation of name abundance is', geo_sd

## get a gaussian distribution of mean = geo_mean and sd = geo_sd
## of length len(p)

dist_param = sps.norm(loc = geo_mean, scale = geo_sd)
dist = dist_param.rvs(size = sum(p))

## now get the probability of these values

print 'wait for it, generating name probabilities...'
temp_hist = py.hist(dist, bins = len(p))
probs = temp_hist[0]
probs = probs / sum(probs) # potentially max(probs)

At this point we have a list of probabilities the same length as our list of names and preferencing names of middle abundance. The next and final step is to generate two pools of possible names. The first pool is derived from a biased-random selection that takes into account the probabilities, birth dates, due date, and address at time of conception. The second, truly random pool is a subset of the first with the desired size (here 100 names).

possible_names = np.random.choice(names, size = size, p = probs, replace = True)
final_names = np.random.choice(possible_names, size = get, replace = False)

And finally, print your list of names! I recommend roulette or darts to narrow this list further.

with open('pick_your_kids_name.txt', 'w') as output:
    for name in final_names:
        print name
        print >> output, name
7979 Total Views 2 Views Today
This entry was posted in Uncategorized. Bookmark the permalink.

3 Responses to And now…

  1. I love this Jeff and Sarah! Nerds ; )

Leave a Reply

Your email address will not be published. Required fields are marked *

WordPress Anti Spam by WP-SpamShield