Welcome to Aklearn's documentation!
========================================

Building a Machine Learning (ML) Library From Scratch: Aklearn
-----------------------------------------------------------------

This repository contains my Summer 2021 project, to build a library to perform machine learning (ML) in Python, in the spirit of Tidymodels in R or Sklearn in Python. As much as possible, I will implement all algorithms from scratch (no calling sklearn, except maybe for tests). The algorithms will be organized using Python classes to keep the code D-R-Y, and will be supplemented by classes offering data processing, train-test splitting, cross-validating, and model evaluation functionality.

My motivation for this project is to upgrade my skills in data science and programming in Python, to complement what I have already learned in R. I want to particularly improve my abilities with object oriented programming, numpy, numba, and general data analysis/visualization. In addition, I want to explore the 'ecosystem' of software engineering: using version control, good code organization, unit testing, extensive documentation, and usability (for others). I will use Git on my computer for version control, using a 2 branch (development and main) workflow, and push changes to `Github <https://github.com/akprasadan/aklearn>`_. Then, Sphinx will automatically carry over my documentation to this website which was produced by both Sphinx and Read the Docs.  Additionally, I provide example usage.

I Want to Get Started
-------------------------

- See the :ref:`TOC` 
- Check out :ref:`quickstartlabel` for a brief tutorial
- Head over to `my Github repository <https://github.com/akprasadan/aklearn>`_.

Progress So Far 
-------------------


The labellings below are as follows:
 - A single +: code is running without bugs
 - A pair of ++: code is running and reasonably tested 
 - A triple +++: code is running/comprehensively tested. Agreement with sklearn is sufficient for this.
 - If applicable, an \* indicates the algorithm agrees with sklearn

Algorithms
^^^^^^^^^^^^^^^^^

- Classification: KNN (+++, \*), Logistic (+++, \*), QDA (+++, \*), LDA(+)
- Regression: Linear (+++, \*), Poisson (+, \*), KNN (+)
- Clustering: K-Means (++)
- Model evaluation techniques: accuracy (+++, \*), confusion matrix (+++, \*), MSE (+++)

Data Engineering and Preparation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- Train/test splitting (+++)
- Cross-validation folds (+++)
- Data Standardization (+++, \*)

Workflow
^^^^^^^^^^^^^^^^^^

- Consistent use of Git/Github for version control and website generation.
- Increased familiarity with 'virtual environments' and organizing Python files into packages (setup.py, requirements.txt, __init__.py, et cetera).
- Autogenerated documentation: essentially, I write documentation in my code and include text files with additional content, and I run 4 commands (build the HTML website using Sphinx and 3 Git commands) and this website automatically collects all of the documentation into this elegant form.

To Do
------

By the middle of July, 2021, I would like to have:

- Linear discriminant analysis (re-using my QDA code) (easy)
- LASSO or other penalized estimators (harder)
- Bootstrapping functionality (easy-ish)
- Abstract tuning class and incorporation into child classes
- Start to add more exposition and examples into the website, for (hypothetical) users

Eventually...

- Classification trees, bagging, random forests (the latter two are easy once I do the first)
- Boosting
- Support vector machines
- Stacking functionality (fairly easy)
- A neural network implementation (hard)
- Awesome docs!

.. _TOC:

Table of Contents
===================

.. toctree::
   quickguide.md
   regression.rst
   classification.rst
   clustering.rst
   evaluation_metrics.rst
   preprocessing.rst
   

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`