Bulding a model from scratch: Predicting Diabetes

Last updated: 19 May 2025 · 14 min read

Over the past few weeks, I’ve been working through the fantastic Practical Deep Learning course with FastAI, and after completing the Titanic dataset example in the course, I wanted to try the same principles on a custom data set—Pima Indians Diabetes.As we did in the course, I decided to do it twice:

Once with pure PyTorch from scratch, to really understand how gradients flow.
Then again with FastAI’s Tabular Learner, to see how much boilerplate I could avoid.

You can open and follow the actual code in these Kaggle notebooks:

1. Dataset & the missing-values rabbit hole

This part took longer than expected. The dataset represented some missing values as 0s in columns like Glucose, BloodPressure, and BMI—but I couldn’t find it in the dataset documentation on Kaggle, so these following lines of code took me days to figure out:

missing_cols = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
df[missing_cols] = df[missing_cols].replace(0,np.nan)
df = df.fillna(df.median())

Without that, my first few models were stuck at only ~62% accuracy. Once I filled missing values correctly, things improved fast.

2. Training a model from scratch in PyTorch

I began by loading the data manually and calculating predictions with a hand-built function. Here’s a peek:

def calc_preds(coeffs,indeps) : return ((indeps*coeffs)+bias).sum(axis=1)
def calc_loss(coeffs,indeps,deps) : return torch.abs(calc_preds(coeffs,indeps) - deps).mean()

The training loop was completely manual—just basic Python and tensor ops:

def train_model(epochs=20,lr=0.1):
    coeffs = init_coeffs()
    for i in range(epochs): one_epoch(coeffs,lr)
    return coeffs

Here is how one_epoch looked like in my deep learning example:

def one_epoch(coeffs,lr):
   loss = calc_loss(coeffs,trn_indep,trn_dep)
   loss.backward()
   with torch.no_grad(): update_coeffs(coeffs,lr)
   print(f"{loss:.3f}", end="; ")

I upgraded from a linear model to a single hidden-layer MLP, then to two layers. Eventually, I hit 79.7 % accuracy on validation.

3. FastAI in ~four lines

Next, I tried FastAI’s TabularPandas API. Here’s how little code it took to preprocess, split, and train the model:

to = TabularPandas(
    df, 
    procs=[FillMissing, Normalize, Categorify],
    cat_names=[], 
    cont_names=cont_names,
    y_names='Outcome', 
    y_block=CategoryBlock(), 
    splits=splits)
dls = to.dataloaders(bs=64)
learn = tabular_learner(dls, metrics=accuracy, layers=[10,10])
learn.fit(12, lr=1e-2)

I used learn.lr_find() to locate a good learning rate, then trained for 12 epochs. This gave me a clean 80.4 % accuracy—slightly better and in a fraction of the code.

4. Bonus: Ensembling

To push performance higher, I trained five learners using a simple function and averaged their predictions, that’s called Ensembling

def ensemble():
    learn = tabular_learner(dls, metrics=accuracy, layers=[10,10])
    with learn.no_bar(),learn.no_logging(): learn.fit(16, lr=0.03)
    return learn.get_preds(dl=tst_dl)[0]
learns = [ensemble() for _ in range(5)]
avg_preds = torch.stack(learns).mean(0)
accuracy_score(tst_dl.items['Outcome'], avg_preds.argmax(1))

This bumped accuracy up to a solid 81.5 % on my test set, which

5. Lessons learned

Read the dataset docs. Missing values in disguise can break your whole pipeline.
Don’t skip df.describe() and df.hist(). Always visualize!
Manual coding builds intuition. I now understand every piece of backprop better.
FastAI makes exploration faster. It let me focus on ideas, not syntax.
Try ensembling early. It’s a low-effort way to gain 1–2 % extra accuracy.

That’s it for now! If you use these notebooks or improve on them, I’d love to hear your results. Drop a link in the comments or message me!

It's Farhad!