Welcome to Aklearn's documentation! ======================================== Building a Machine Learning (ML) Library From Scratch: Aklearn ----------------------------------------------------------------- This repository contains my Summer 2021 project, to build a library to perform machine learning (ML) in Python, in the spirit of Tidymodels in R or Sklearn in Python. As much as possible, I will implement all algorithms from scratch (no calling sklearn, except maybe for tests). The algorithms will be organized using Python classes to keep the code D-R-Y, and will be supplemented by classes offering data processing, train-test splitting, cross-validating, and model evaluation functionality. My motivation for this project is to upgrade my skills in data science and programming in Python, to complement what I have already learned in R. I want to particularly improve my abilities with object oriented programming, numpy, numba, and general data analysis/visualization. In addition, I want to explore the 'ecosystem' of software engineering: using version control, good code organization, unit testing, extensive documentation, and usability (for others). I will use Git on my computer for version control, using a 2 branch (development and main) workflow, and push changes to `Github `_. Then, Sphinx will automatically carry over my documentation to this website which was produced by both Sphinx and Read the Docs. Additionally, I provide example usage. I Want to Get Started ------------------------- - See the :ref:`TOC` - Check out :ref:`quickstartlabel` for a brief tutorial - Head over to `my Github repository `_. Progress So Far ------------------- The labellings below are as follows: - A single +: code is running without bugs - A pair of ++: code is running and reasonably tested - A triple +++: code is running/comprehensively tested. Agreement with sklearn is sufficient for this. - If applicable, an \* indicates the algorithm agrees with sklearn Algorithms ^^^^^^^^^^^^^^^^^ - Classification: KNN (+++, \*), Logistic (+++, \*), QDA (+++, \*), LDA(+) - Regression: Linear (+++, \*), Poisson (+, \*), KNN (+) - Clustering: K-Means (++) - Model evaluation techniques: accuracy (+++, \*), confusion matrix (+++, \*), MSE (+++) Data Engineering and Preparation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Train/test splitting (+++) - Cross-validation folds (+++) - Data Standardization (+++, \*) Workflow ^^^^^^^^^^^^^^^^^^ - Consistent use of Git/Github for version control and website generation. - Increased familiarity with 'virtual environments' and organizing Python files into packages (setup.py, requirements.txt, __init__.py, et cetera). - Autogenerated documentation: essentially, I write documentation in my code and include text files with additional content, and I run 4 commands (build the HTML website using Sphinx and 3 Git commands) and this website automatically collects all of the documentation into this elegant form. To Do ------ By the middle of July, 2021, I would like to have: - Linear discriminant analysis (re-using my QDA code) (easy) - LASSO or other penalized estimators (harder) - Bootstrapping functionality (easy-ish) - Abstract tuning class and incorporation into child classes - Start to add more exposition and examples into the website, for (hypothetical) users Eventually... - Classification trees, bagging, random forests (the latter two are easy once I do the first) - Boosting - Support vector machines - Stacking functionality (fairly easy) - A neural network implementation (hard) - Awesome docs! .. _TOC: Table of Contents =================== .. toctree:: quickguide.md regression.rst classification.rst clustering.rst evaluation_metrics.rst preprocessing.rst Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`