Skip to content
View TranDenyDFW's full-sized avatar
👋
👋

Block or report TranDenyDFW

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
TranDenyDFW/README.md

Howdy! 👋
I'm Deny.

LinkedInBadge    MicrosoftLearningBadge    HuggingFaceBadge    KaggleBadge

About Me

I’m Deny Tran, a data professional with a passion for solving complex problems and continuous learning. My career path has been driven by curiosity and a desire to excel, from mastering Python and SQL to navigating the intricacies of machine learning.


Self-Concocted SQL-/ETL-/ML-/Analyst-Related Stuff

fsnds_graph
  • Based on one of the many parsed SEC's ABS-EE (Asset Backed Securities - Electronic Exhibits) XML datasets.
  • Purpose of illustration was more-or-less for:
    • Testing the parsed datasets,
    • practicing SQL,
    • working with ML models/libraries, and
    • working with PySpark
  • Some key takeaways here were:
    • PySpark is very good with working with large amounts of data, but is a different story when trying to get it to use the GPU.
      • It's nice that PySpark developers have ML options, but for serious work, you'll likely have to use PyTorch or TensorFlow.
    • DuckDB is amazing. It handled the 35M+ rows here easily. On top of that (prior to this), the 550M+ rows of data from the FSNDS was easily consumed and queried.
      • With that said, DuckDB will only work well if you let it do the indexing. For example, if you tried to add all the FSNDS primary/foreign keys, it'll become unresponsive during data insert.
      • So, a good quick go-to for big datasets. But not so much if you're looking for something more.

  • This was an extraction of the FSNDS to test the quality of the data after it was exported from SQL Server.

  • Similarly, an extraction of the FSNDS, but used here instead, for creating training data for a Tabular LLM (i.e., TaBERT).

  • A financial model I created back in 2017 from scraped data.
  • The way the data is stored is somewhat naive, but the project as a whole was nonetheless, enlightening.

operating_model


cmsdata

Popular repositories Loading

  1. SEC_Full_Index_To_Docker_SQL_Server SEC_Full_Index_To_Docker_SQL_Server Public archive

    Python

  2. SEC_Financial_Statement_Data_Sets_To_Docker_MySQL SEC_Financial_Statement_Data_Sets_To_Docker_MySQL Public archive

    Python

  3. SEC_Financial_Statement_Data_Set_To_SQLite3 SEC_Financial_Statement_Data_Set_To_SQLite3 Public archive

    Python

  4. haystack haystack Public

    Forked from deepset-ai/haystack

    🔍 LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your d…

    Python

  5. chatgpt-retrieval-plugin chatgpt-retrieval-plugin Public

    Forked from openai/chatgpt-retrieval-plugin

    The ChatGPT Retrieval Plugin lets you easily find personal or work documents by asking questions in natural language.

    Python

  6. openai-cookbook openai-cookbook Public

    Forked from openai/openai-cookbook

    Examples and guides for using the OpenAI API

    MDX