Learn/'24_Fall_(EE599) DataScience

(Final Project) Machine Learning-based Intraday Stock Price Prediction with high-frequency data analysis

QBBong 2024. 12. 22. 22:25
728x90

Machine Learning-based Intraday Stock Price Prediction with high-frequency data analysis

Introduction

Stock market prediction is a complex but fascinating area in financial technology. Our project aimed to develop a machine-learning-based intraday stock price prediction model using high-frequency data. By progressing through three milestones, we refined our approach, integrated meaningful features, and tackled challenges in data handling and model optimization.

This blog post summarizes the journey through each milestone, highlighting our methods, findings, and lessons learned.


Milestone 1: Project Foundation

Objective

Define the scope and framework for the project, set up the foundational tools, and collect preliminary data.

  1. Problem Definition
    The project focused on predicting stock price movements using machine learning. Specifically, we aimed to:
    • Analyze high-frequency intraday data.
    • Incorporate external macroeconomic indicators for enhanced prediction.
  2. Data Collection
    We leveraged yfinance to gather historical minute-by-minute price data for S&P 500 companies and AAPL. Data fields included:
    • Open, High, Low, Close, Volume, and Ticker.
  3. Initial Setup
    Tools such as Python, pandas, and scikit-learn were used for data preprocessing and exploratory analysis.

Outcome

We laid the groundwork for future milestones by ensuring reliable data access and defining the problem in terms of machine learning.


Milestone 2: Feature Engineering and Labeling

Objective

Engineer meaningful features and label the data for supervised learning.

  1. Feature Engineering
    Introduced Bollinger Bands as a technical indicator to capture stock price volatility:
    • SMA: Simple Moving Average.
    • Upper Band and Lower Band: SMA ± 2 × Standard Deviation.
  2. Triple Barrier Labeling
    Data was labeled using the triple-barrier method:
    • Profit Target and Stop Loss: Defined based on standard deviation and correction factors.
    • Labels: 1 for upward movement, -1 for downward movement, and 0 for no significant change.
  3. External Indicators
    Added macroeconomic factors, including:
    • Fed Rate: Represents interest rates.
    • Crude Oil Price: Reflects global economic health.
    • VIX Index: Captures market volatility.
  4. Challenges
    • Handling high-frequency data required robust cleaning and interpolation.
    • Balancing labels involved experimenting with correction factors to address class imbalance.

Outcome

This milestone concluded with a fully labeled dataset and a well-engineered feature set for model training.


Milestone 3: Model Training and Finalization

Objective

Train machine learning models and evaluate their performance using the engineered features.

  1. Model Selection
    • Experimented with Random Forest Classifier and Support Vector Machines (SVM) for their robustness and interpretability.
  2. Data Splitting
    Divided the dataset into training, validation, and testing sets to ensure unbiased evaluation.
  3. Optimization
    • Correction factors were fine-tuned to achieve balanced labels.
    • Feature importance analysis revealed that external indicators significantly improved prediction accuracy.
  4. Performance Evaluation
    Metrics such as precision, recall, and F1-score highlighted strengths in predicting upward movements. Downward movement predictions posed challenges due to class imbalance.
  5. Conclusions
    • Successfully captured market trends using high-frequency data and macroeconomic indicators.
    • Insights gained could be extended to real-time trading systems with further refinements.

Conclusion

This project demonstrated the potential of combining technical and macroeconomic features to predict stock price movements. Each milestone addressed critical aspects of the problem, from foundational setup to feature engineering and final model training.

While the results were promising, challenges such as data imbalance and computational efficiency remain areas for future exploration. This journey reinforced the importance of data preprocessing, feature selection, and iterative optimization in developing predictive models.


What’s Next?

For future work, integrating deep learning models like LSTMs could capture temporal dependencies in high-frequency data. Additionally, experimenting with real-time data pipelines may bring the project closer to deployment in live trading scenarios.

 

(Milestone3)Presentation.pdf
1.8 MB
(Final)Project_Report.pdf
1.7 MB

 

728x90
반응형