Bulding a model from scratch: Predicting Diabetes

Last updated: 19 May 2025 · 14 min read

Over the past few weeks, I’ve been working through the fantastic  Practical Deep Learning course with FastAI, and after completing the Titanic dataset example in the course, I wanted to try the same principles on a custom data set—Pima Indians Diabetes.As we did in the course, I decided to do it twice:

  • Once with pure PyTorch from scratch, to really understand how gradients flow.
  • Then again with FastAI’s Tabular Learner, to see how much boilerplate I could avoid.

You can open and follow the actual code in these Kaggle notebooks:

1. Dataset & the missing-values rabbit hole

This part took longer than expected. The dataset represented some missing values as 0s in columns like Glucose, BloodPressure, and BMI—but I couldn’t find it in the dataset documentation on Kaggle, so these following lines of code took me days to figure out:

missing_cols = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
df[missing_cols] = df[missing_cols].replace(0,np.nan)
df = df.fillna(df.median())

Without that, my first few models were stuck at only ~62% accuracy. Once I filled missing values correctly, things improved fast.

2. Training a model from scratch in PyTorch

I began by loading the data manually and calculating predictions with a hand-built function. Here’s a peek:

def calc_preds(coeffs,indeps) : return ((indeps*coeffs)+bias).sum(axis=1)
def calc_loss(coeffs,indeps,deps) : return torch.abs(calc_preds(coeffs,indeps) - deps).mean()

The training loop was completely manual—just basic Python and tensor ops:

def train_model(epochs=20,lr=0.1):
    coeffs = init_coeffs()
    for i in range(epochs): one_epoch(coeffs,lr)
    return coeffs

Here is how one_epoch looked like in my deep learning example:

def one_epoch(coeffs,lr):
   loss = calc_loss(coeffs,trn_indep,trn_dep)
   loss.backward()
   with torch.no_grad(): update_coeffs(coeffs,lr)
   print(f"{loss:.3f}", end="; ")

I upgraded from a linear model to a single hidden-layer MLP, then to two layers. Eventually, I hit 79.7 % accuracy on validation.

3. FastAI in ~four lines

Next, I tried FastAI’s TabularPandas API. Here’s how little code it took to preprocess, split, and train the model:

to = TabularPandas(
    df, 
    procs=[FillMissing, Normalize, Categorify],
    cat_names=[], 
    cont_names=cont_names,
    y_names='Outcome', 
    y_block=CategoryBlock(), 
    splits=splits)
dls = to.dataloaders(bs=64)
learn = tabular_learner(dls, metrics=accuracy, layers=[10,10])
learn.fit(12, lr=1e-2)

I used learn.lr_find() to locate a good learning rate, then trained for 12 epochs. This gave me a clean 80.4 % accuracy—slightly better and in a fraction of the code.

4. Bonus: Ensembling

To push performance higher, I trained five learners using a simple function and averaged their predictions, that’s called Ensembling

def ensemble():
    learn = tabular_learner(dls, metrics=accuracy, layers=[10,10])
    with learn.no_bar(),learn.no_logging(): learn.fit(16, lr=0.03)
    return learn.get_preds(dl=tst_dl)[0]
learns = [ensemble() for _ in range(5)]
avg_preds = torch.stack(learns).mean(0)
accuracy_score(tst_dl.items['Outcome'], avg_preds.argmax(1))

This bumped accuracy up to a solid 81.5 % on my test set, which

5. Lessons learned

  1. Read the dataset docs. Missing values in disguise can break your whole pipeline.
  2. Don’t skip df.describe() and df.hist(). Always visualize!
  3. Manual coding builds intuition. I now understand every piece of backprop better.
  4. FastAI makes exploration faster. It let me focus on ideas, not syntax.
  5. Try ensembling early. It’s a low-effort way to gain 1–2 % extra accuracy.

That’s it for now! If you use these notebooks or improve on them, I’d love to hear your results. Drop a link in the comments or message me!

Leave a Reply

Your email address will not be published. Required fields are marked *