AI Software Engineering Week 3 Assignment

Author: George Wanjohi

Objective: To implement and deploy machine learning models for digit classification, sentiment analysis, and ethical AI considerations.

Part 1 — Theory

Q1: TensorFlow vs PyTorch

TensorFlow vs PyTorch — key differences

  • Execution model: PyTorch uses dynamic (eager) execution which is Pythonic and easy to debug. TensorFlow historically used static graphs but TensorFlow 2.x defaults to eager execution and is closer to PyTorch.
  • Use cases: PyTorch is commonly used for research and fast prototyping due to its intuitive API. TensorFlow is well-suited for production and deployment (TF Serving, TF Lite, TF Hub).
  • Ecosystem & tooling: TensorFlow has a broad production ecosystem; PyTorch has strong research adoption and growing production tools (TorchServe).
  • Rule of thumb: choose PyTorch for experimentation/research; choose TensorFlow when you need mature production tooling or specific TF integrations.

Q2: Two use cases for Jupyter Notebooks

  1. Interactive prototyping and experiments: run small code blocks iteratively, inspect outputs and tweak models without running a full script.
  2. Reproducible reports and visualizations: combine narrative, code, plots, and results in one document for sharing and teaching.

Q3: How spaCy improves NLP vs basic string ops

  • Tokenization & linguistics: spaCy provides robust, language-aware tokenization, POS tagging, and dependency parsing; basic string ops cannot reliably split or normalize text.
  • Pretrained models & NER: spaCy includes pretrained pipelines for Named Entity Recognition (PRODUCT, ORG, PERSON), which work across varied text; string searches are brittle and miss variations.
  • Pipeline & extensibility: spaCy offers matchers and rule-based add-ons (EntityRuler) to incorporate custom patterns without reinventing low-level parsing.

Comparative Table: Scikit-learn vs TensorFlow

Facet Scikit-learn TensorFlow
Target applications Classical ML: SVM, trees, clustering, preprocessing Deep learning: neural networks, CNNs, RNNs, production ML pipelines
Ease for beginners Very beginner-friendly, consistent API Steeper learning curve (TF 2.x is easier), more concepts
Community & Ecosystem Mature for classical ML; many utilities Massive (Google-backed), strong production tooling and model-serving options

Part 2 — Practical

Task 1: Iris Flower Classification

Description: The Iris dataset contains 150 samples of iris flowers with sepal and petal measurements. Preprocessing involves standardizing features for better model performance.

Critical Code Snippets

# Preprocessing
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Training
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train)

Classification Report Screenshot

Iris classification report

Task 2: MNIST Digit Recognition

Description: CNN architecture includes Conv2D layers with ReLU, MaxPooling, Flatten, Dense layers with Dropout.

Training details: 15 epochs, batch size 128, final test accuracy 99.42% with data augmentation.

Training Accuracy/Loss Graph

MNIST training accuracy graph

5 Sample Predictions

MNIST sample predictions

Saved model: practical/tensorflow/mnist_cnn_improved_model.h5

Task 3: spaCy Sentiment and NER

Description: Sample reviews from Amazon dataset processed with spaCy for named entity recognition and sentiment analysis.

Entity Extraction Table

spaCy entities table

Displacery NER Visualization

spaCy NER visualization

Part 3 — Ethics & Optimization

Reflection on Bias & Debugging

In MNIST, digit recognition may show bias towards Western-style handwriting, potentially misclassifying diverse scripts. Amazon reviews could contain biased sentiments due to demographic imbalances in review authors. Ethical considerations include ensuring fairness, transparency, and mitigating harm from biased predictions.

Mitigations: Use TensorFlow Fairness Indicators for bias auditing, apply dataset augmentation for underrepresented data, implement better annotation rules, and conduct regular audits.

Debugging Section: Code Fix

This section addresses a technical bug encountered during the Gradio app development.

Buggy Code Snippet (from `index.py`):

image = tf.image.resize(image, (28, 28)) # Wrong: passes dict directly

Fixed Code Snippet (from `index.py`):

composite_image = image['composite'] # Properly extract image array
# ... further processing ...

Explanation: The original code attempted to directly resize the dictionary output from Gradio's Sketchpad component, leading to a `ValueError`. The fix involves correctly extracting the `composite` image array from the dictionary before performing TensorFlow's resize operation. This ensures the model receives a valid image tensor.

Screenshot showing corrected app behavior after code fix

Debugging for Bias: MNIST Model

The MNIST dataset, while foundational, primarily represents a limited range of handwriting styles. This can lead to representation bias, where the model performs poorly on digits that deviate from its training distribution.

For instance, an earlier version of our model (without augmentation) might misclassify a thinly drawn '1' as a '7', or a '4' with a closed top as a '9'. This is because it hadn't seen enough variations during training.

Example of Potential Bias (Challenging Input):

Here, we illustrate a digit that our model, even with initial improvements, might struggle with due to subtle variations in handwriting. For instance, a **sketchy '9' was misclassified as a '7' with 98% accuracy**. This highlights how specific drawing styles can still fall outside the model's learned distribution.

Hand-drawn sketchy '9' misclassified as '7'

(This sketchy '9' was predicted as '7' with 98% confidence.)

Mitigation through Data Augmentation:

To address such issues and improve robustness, we implemented extensive data augmentation in our `mnist_cnn.ipynb` notebook. Techniques like random rotations, shifts, and zooms artificially expand the training dataset, exposing the model to a wider variety of digit appearances, including more "sketchy" or unusual styles.

This makes the model more robust and less prone to misclassifications based on minor stylistic differences, by effectively teaching it to generalize better across diverse inputs.

Result with Further Improved Model (Conceptual):

While our current model already includes augmentation, continuous improvement would involve refining these techniques or adding more diverse real-world examples. A further improved model would correctly classify this sketchy '9' as '9'.

Hand-drawn sketchy '9' with correct prediction from a hypothetically further improved model

This demonstrates how proactive data strategies can debug and mitigate biases, leading to a more fair and reliable AI system.

Bonus — Deployment

To run mnist_app_gradio.py: Navigate to bonus/ and execute python index.py. The Gradio interface launches at http://127.0.0.1:7860 for live digit classification.

Gradio app demo

Conclusion