Howdy! 👋
I'm Deny.
I’m Deny Tran, a data professional with a passion for solving complex problems and continuous learning. My career path has been driven by curiosity and a desire to excel, from mastering Python and SQL to navigating the intricacies of machine learning.
- Built from the SEC's Financial Statements & Notes .tsv datasets.
- Environment: Docker on Ubuntu w/SQL Server 2019.
- Joins/keys were based on guidance from FSNDS Notes, but the results were mainly trial and error due to the quality of the dataset.
- Configurations were based on general SQL best practices, including bit and pieces of ideas from the following fantastic works:
- Based on one of the many parsed SEC's ABS-EE (Asset Backed Securities - Electronic Exhibits) XML datasets.
- Purpose of illustration was more-or-less for:
- Testing the parsed datasets,
- practicing SQL,
- working with ML models/libraries, and
- working with PySpark
- Some key takeaways here were:
- PySpark is very good with working with large amounts of data, but is a different story when trying to get it to use the GPU.
- It's nice that PySpark developers have ML options, but for serious work, you'll likely have to use PyTorch or TensorFlow.
- DuckDB is amazing. It handled the 35M+ rows here easily. On top of that (prior to this), the 550M+ rows of data from the FSNDS was easily consumed and queried.
- With that said, DuckDB will only work well if you let it do the indexing. For example, if you tried to add all the FSNDS primary/foreign keys, it'll become unresponsive during data insert.
- So, a good quick go-to for big datasets. But not so much if you're looking for something more.
- PySpark is very good with working with large amounts of data, but is a different story when trying to get it to use the GPU.
- This was an extraction of the FSNDS to test the quality of the data after it was exported from SQL Server.
- Similarly, an extraction of the FSNDS, but used here instead, for creating training data for a Tabular LLM (i.e., TaBERT).
- A financial model I created back in 2017 from scraped data.
- The way the data is stored is somewhat naive, but the project as a whole was nonetheless, enlightening.