10 Easy Steps to Build a Data Science App with Python


10 Easy Steps to Build a Data Science App with Python


Creating a data science application is an excellent hands-on way to deepen your understanding of the field. This guide will walk you through every step, from gathering and preprocessing data to training a model and deploying predictions via an API.

In this tutorial, we'll use Python, scikit-learn, and FastAPI to develop a straightforward data science app. The app will leverage the built-in wine dataset from scikit-learn to keep the process simple and accessible. By the end, you'll have a functional API serving machine learning predictions.

Ready to dive in? Let’s begin the journey to building your first data science application!

 

Step 1: Setting Up the Environment

Make sure you have a recent version of Python installed on your system. Next, install the required libraries to build the machine learning model and set up the API for serving predictions. Here's how:

$ pip3 install fastapi uvicorn scikit-learn pandas


This command will install the necessary libraries:

  • FastAPI: To create and manage the API.
  • Uvicorn: A server to run the API.
  • scikit-learn: For building and training the machine learning model.
  • Pandas: To handle and preprocess data efficiently.


Step 2: Loading the Dataset

We'll use the wine dataset from scikit-learn. To make data manipulation straightforward, we'll load it and convert it into a pandas DataFrame. Here's how:

# model_training.py
import pandas as pd
from sklearn.datasets import load_wine

def get_wine_data():
    # Load the wine dataset
    wine_data = load_wine()
    # Create a DataFrame with feature names
    data_frame = pd.DataFrame(wine_data.data, columns=wine_data.feature_names)
    # Add the target column to the DataFrame
    data_frame['target'] = wine_data.target
    return data_frame

 

Step 3: Exploring the Dataset

 
Before moving forward, it's a good idea to explore the dataset briefly. This will help you understand the structure of the data and identify any preprocessing needs. Here's how you can do it:

# Exploring the dataset
if __name__ == "__main__":
    df = get_wine_data()

    # Display the first few rows
    print("Dataset Preview:")
    print(df.head())

    # Check for missing values
    print("\nMissing Values:")
    print(df.isnull().sum())

    # Display basic statistical details
    print("\nDataset Statistics:")
    print(df.describe())

    # Check the distribution of target classes
    print("\nTarget Class Distribution:")
    print(df['target'].value_counts())

 

This exploration provides insights into the dataset's features, the presence of any missing values, statistical summaries, and the distribution of target classes.

 

Step 4: Data Preprocessing

Next, we will preprocess the dataset by splitting it into training and test sets and scaling the features. 

The preprocess_data function is designed specifically for this purpose:...

# model_training.py
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def preprocess_data(df):
    """
    Preprocesses the dataset by splitting it into training and test sets
    and scaling the features for better model performance.

    Parameters:
    - df: pandas DataFrame, the input dataset containing features and target.

    Returns:
    - X_train_scaled: numpy array, scaled training features.
    - X_test_scaled: numpy array, scaled test features.
    - y_train: pandas Series, training target values.
    - y_test: pandas Series, test target values.
    """
    # Separate features and target variable
    X = df.drop(columns=['target'])  # Features
    y = df['target']                # Target (wine quality)

    # Split dataset into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=27
    )

    # Scale features using StandardScaler
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    return X_train_scaled, X_test_scaled, y_train, y_test

 

Using StandardScaler for feature scaling ensures that all features are standardized to have the same scale, allowing them to contribute equally during model training.


Step 5: Training the Logistic Regression Model

 
Next, we will train a LogisticRegression model using the preprocessed data and save the trained model to a pickle file. The train_model function handles this process:

# model_training.py
from sklearn.linear_model import LogisticRegression
import pickle

def train_model(X_train, y_train):
    """
    Trains a LogisticRegression model on the given training data 
    and saves the trained model to a pickle file.

    Parameters:
    - X_train: numpy array, the scaled training features.
    - y_train: pandas Series or numpy array, the training target values.

    Returns:
    - model: trained LogisticRegression model.
    """
    # Initialize and train the Logistic Regression model
    model = LogisticRegression(random_state=42)
    model.fit(X_train, y_train)

    # Save the trained model to a pickle file
    with open('classifier.pkl', 'wb') as file:
        pickle.dump(model, file)

    return model

 

Step 6: Evaluating the Model

 
After training the model, we evaluate its performance by calculating its accuracy on the test set. The evaluate_model function is defined as follows:

# model_training.py
from sklearn.metrics import accuracy_score

def evaluate_model(model, X_test, y_test):
    """
    Evaluates the performance of a trained model by calculating its accuracy on the test set.

    Parameters:
    - model: trained model to be evaluated.
    - X_test: numpy array, scaled test features.
    - y_test: pandas Series or numpy array, true labels for the test set.

    Returns:
    - accuracy: float, the calculated accuracy of the model.
    """
    # Generate predictions on the test set
    y_pred = model.predict(X_test)

    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Model Accuracy: {accuracy:.2f}")
    return accuracy

if __name__ == "__main__":
    # Load and preprocess the data
    df = load_wine_data()
    X_train_scaled, X_test_scaled, y_train, y_test = preprocess_data(df)

    # Train the model
    model = train_model(X_train_scaled, y_train)

    # Evaluate the trained model
    evaluate_model(model, X_test_scaled, y_test)

 

When you execute the Python script, the following steps are performed sequentially: the data is loaded, preprocessed, the model is trained, and its performance is evaluated. Upon running the script, the output displays:

Accuracy: 0.98

 

Step 7: Setting Up FastAPI

 
Next, we’ll create a basic FastAPI application to serve predictions using the trained model. This application will allow users to send data to the API and receive predictions from the LogisticRegression model.

# app.py
from fastapi import FastAPI

app = FastAPI()

@app.get("/")
def home():
    """
    Root endpoint for the API.
    Returns a welcome message.
    """
    return {"message": "Welcome to the Prediction API - Genius data science"}

 

In this step, we set up a basic FastAPI application and defined a root endpoint, which creates a simple web server capable of responding to HTTP requests. 

You can run the FastAPI app by executing the following command:

uvicorn app:app --reload

 

Go to http://127.0.0.1:8000 to see the message.

 

Step 8: Loading the Model in FastAPI

 
We will now load the pre-trained Logistic Regression model within the FastAPI application to make predictions. 

Let’s define a function that loads the pre-trained model into the FastAPI app:

# app.py
import pickle

def load_model():
    """
    Loads the pre-trained Logistic Regression model from a pickle file.

    Returns:
    - model: The loaded Logistic Regression model.
    """
    with open('classifier.pkl', 'rb') as file:
        model = pickle.load(file)
    return model

 

This means the model is now ready to generate predictions whenever it receives a request.

 

Step 9: Creating the Prediction Endpoint

 
We will define an endpoint that accepts wine features as input and returns the predicted wine quality class.
 

Define Input Data Model

We will create a prediction endpoint that accepts wine feature data in JSON format. The input data model will be defined using Pydantic to validate the incoming data.

# app.py
from pydantic import BaseModel

class WineFeatures(BaseModel):
    """
    Pydantic model to validate the incoming wine feature data.
    Each attribute represents a specific wine feature.
    """
    alcohol: float
    malic_acid: float
    ash: float
    alcalinity_of_ash: float
    magnesium: float
    total_phenols: float
    flavanoids: float
    nonflavanoid_phenols: float
    proanthocyanins: float
    color_intensity: float
    hue: float
    od280_od315_of_diluted_wines: float
    proline: float

 

Prediction Endpoint

When a request is received, the API uses the loaded model to predict the wine quality class based on the provided features.

# app.py
@app.post("/predict")
def predict_wine(features: WineFeatures):
    """
    Accepts wine feature data as input and returns the predicted wine quality class.

    Parameters:
    - features: WineFeatures, the input wine feature data validated by Pydantic.

    Returns:
    - prediction: The predicted wine quality class (integer).
    """
    model = load_model()  # Load the pre-trained model
    input_data = [[
        features.alcohol, features.malic_acid, features.ash, features.alcalinity_of_ash,
        features.magnesium, features.total_phenols, features.flavanoids,
        features.nonflavanoid_phenols, features.proanthocyanins, features.color_intensity,
        features.hue, features.od280_od315_of_diluted_wines, features.proline
    ]]
    
    prediction = model.predict(input_data)  # Predict wine quality class
    return {"prediction": int(prediction[0])}  # Return the prediction as an integer

 

Step 10: Testing the Application Locally

 
You can rerun the app by running:

uvicorn app:app --reload

 

To test the application, you can send a POST request to the /predict endpoint with the wine feature data in JSON format. 

Here is an example of how to do this using curl or tools like Postman:

curl -X 'POST' \
  'http://127.0.0.1:8000/predict' \
  -H 'Content-Type: application/json' \
  -d '{
  "alcohol": 13.2,
  "malic_acid": 2.87,
  "ash": 2.35,
  "alcalinity_of_ash": 19.0,
  "magnesium": 100.0,
  "total_phenols": 2.8,
  "flavanoids": 3.06,
  "nonflavanoid_phenols": 0.3,
  "proanthocyanins": 1.2,
  "color_intensity": 5.2,
  "hue": 1.03,
  "od280_od315_of_diluted_wines": 3.0,
  "proline": 1500
}'

 

If the request is successful, you will receive a JSON response with the predicted wine quality class, like:

{"prediction":5}

 

This output indicates the predicted wine quality class based on the input features.


Conclusion

 
We’ve successfully built a simple yet functional data science app! 

Here’s a recap of the process: 

- We trained a machine learning model using scikit-learn. 

- We used FastAPI to create an API that allows users to input wine feature data and receive predictions from the trained model. 

The app is now capable of handling user input, making predictions, and returning results via an HTTP interface. This foundation opens up many possibilities for enhancement: 

- Building more complex models: You can try using more advanced algorithms or deep learning models. 

- Adding additional features: You can integrate more input fields, preprocess the data in different ways, or add more model evaluation metrics. 

- Scaling: Once you're confident the app works, consider deploying it to cloud platforms for scalability and accessibility. 

With this approach, you can easily extend the functionality to fit more advanced use cases. Enjoy exploring and building on this framework!