Last updated: 19 May 2025 · 14 min read
Over the past few weeks, I’ve been working through the fantastic Practical Deep Learning course with FastAI, and after completing the Titanic dataset example in the course, I wanted to try the same principles on a custom data set—Pima Indians Diabetes.As we did in the course, I decided to do it twice:
- Once with pure PyTorch from scratch, to really understand how gradients flow.
- Then again with FastAI’s Tabular Learner, to see how much boilerplate I could avoid.
You can open and follow the actual code in these Kaggle notebooks:
1. Dataset & the missing-values rabbit hole
This part took longer than expected. The dataset represented some missing values as 0s in columns like Glucose
, BloodPressure
, and BMI
—but I couldn’t find it in the dataset documentation on Kaggle, so these following lines of code took me days to figure out:
missing_cols = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']
df[missing_cols] = df[missing_cols].replace(0,np.nan)
df = df.fillna(df.median())
Without that, my first few models were stuck at only ~62% accuracy. Once I filled missing values correctly, things improved fast.
2. Training a model from scratch in PyTorch
I began by loading the data manually and calculating predictions with a hand-built function. Here’s a peek:
def calc_preds(coeffs,indeps) : return ((indeps*coeffs)+bias).sum(axis=1)
def calc_loss(coeffs,indeps,deps) : return torch.abs(calc_preds(coeffs,indeps) - deps).mean()
The training loop was completely manual—just basic Python and tensor ops:
def train_model(epochs=20,lr=0.1):
coeffs = init_coeffs()
for i in range(epochs): one_epoch(coeffs,lr)
return coeffs
Here is how one_epoch looked like in my deep learning example:
def one_epoch(coeffs,lr):
loss = calc_loss(coeffs,trn_indep,trn_dep)
loss.backward()
with torch.no_grad(): update_coeffs(coeffs,lr)
print(f"{loss:.3f}", end="; ")
I upgraded from a linear model to a single hidden-layer MLP, then to two layers. Eventually, I hit 79.7 % accuracy on validation.
3. FastAI in ~four lines
Next, I tried FastAI’s TabularPandas
API. Here’s how little code it took to preprocess, split, and train the model:
to = TabularPandas(
df,
procs=[FillMissing, Normalize, Categorify],
cat_names=[],
cont_names=cont_names,
y_names='Outcome',
y_block=CategoryBlock(),
splits=splits)
dls = to.dataloaders(bs=64)
learn = tabular_learner(dls, metrics=accuracy, layers=[10,10])
learn.fit(12, lr=1e-2)
I used learn.lr_find()
to locate a good learning rate, then trained for 12 epochs. This gave me a clean 80.4 % accuracy—slightly better and in a fraction of the code.
4. Bonus: Ensembling
To push performance higher, I trained five learners using a simple function and averaged their predictions, that’s called Ensembling
def ensemble():
learn = tabular_learner(dls, metrics=accuracy, layers=[10,10])
with learn.no_bar(),learn.no_logging(): learn.fit(16, lr=0.03)
return learn.get_preds(dl=tst_dl)[0]
learns = [ensemble() for _ in range(5)]
avg_preds = torch.stack(learns).mean(0)
accuracy_score(tst_dl.items['Outcome'], avg_preds.argmax(1))
This bumped accuracy up to a solid 81.5 % on my test set, which
5. Lessons learned
- Read the dataset docs. Missing values in disguise can break your whole pipeline.
- Don’t skip
df.describe()
anddf.hist()
. Always visualize! - Manual coding builds intuition. I now understand every piece of backprop better.
- FastAI makes exploration faster. It let me focus on ideas, not syntax.
- Try ensembling early. It’s a low-effort way to gain 1–2 % extra accuracy.
That’s it for now! If you use these notebooks or improve on them, I’d love to hear your results. Drop a link in the comments or message me!
Leave a Reply