toyeade1 · October 31, 2024 01:51
diff --git a/firecrawl_funding_db_demo.ipynb b/firecrawl_funding_db_demo.ipynb
 {
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/alexfazio/2671a628e4b10e08974aea4f561981ae/firecrawl_funding_db_demo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "intro-section"
      },
      "source": [
        "# Firecrawl // VC Investment Data Extractor\n",
        "This notebook implements a structured scraper to extract startup investment data from TechCrunch articles using Firecrawl's Map and LLM Extract features. The scraper follows a three-step process:\n",
        "1. Find and retrieve article URLs from only the first three pagination pages in the TechCrunch startups category\n",
        "2. Extract all article URLs found within these three pagination pages\n",
        "3. Extract investment data from all the individual articles collected from those pages\n",
        "Each step includes validation and error handling for reliable data collection. While the scraper processes all articles found within the first three pagination pages, this focused approach ensures we capture recent startup investment data while maintaining efficient execution."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "setup-section"
      },
      "source": [
        "## Setup\n",
        "\n",
        "First, let's install required packages and set up our environment:\n",
        "from copy import deepcopy"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "install-packages"
      },
      "outputs": [],
      "source": [
        "%pip install firecrawl-py pydantic pandas --quiet"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "89881013"
      },
      "outputs": [],
      "source": [
        "\n",
        "from pydantic import BaseModel\n",
        "\n",
        "# Assuming ArticleURLSchema is a Pydantic model based on the context\n",
        "class ArticleURLSchema(BaseModel):\n",
        "    article_url: str\n",
        "    article_title: str\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "imports"
      },
      "outputs": [],
      "source": [
        "from firecrawl import FirecrawlApp\n",
        "from pydantic import BaseModel\n",
        "from getpass import getpass\n",
        "import re\n",
        "from typing import List, Optional, Set\n",
        "from urllib.parse import urljoin"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "api-key-section"
      },
      "source": [
        "## API Key Configuration\n",
        "\n",
        "Set up your Firecrawl API key:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "api-key-setup"
      },
      "outputs": [],
      "source": [
        "api_key = getpass(\"Enter your Firecrawl API key: \")\n",
        "app = FirecrawlApp(api_key=api_key)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "schemas-section"
      },
      "source": [
        "## Data Schemas\n",
        "\n",
        "Define our data extraction schemas using Pydantic:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "schema-definitions"
      },
      "outputs": [],
      "source": [
        "from typing import Dict, Any, Optional\n",
        "\n",
        "class PriceNumber(BaseModel):\n",
        "    format: str = \"dollar\"\n",
        "    value: float\n",
        "\n",
        "class Price(BaseModel):\n",
        "    name: str = \"Price\"\n",
        "    type: str = \"number\"\n",
        "    number: PriceNumber\n",
        "\n",
        "class InvestmentSchema(BaseModel):\n",
        "    \"\"\"Schema for extracting investment data from articles\"\"\"\n",
        "    investor: Optional[str] = None\n",
        "    investee: Optional[str] = None\n",
        "    investment_amount: Optional[Dict[str, Price]] = None\n",
        "    funding_stage: Optional[str] = None\n",
        "    focus_area: Optional[str] = None\n",
        "    publication_date: Optional[str] = None\n",
        "    source_article_url: Optional[str] = None"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "validation-section"
      },
      "source": [
        "## URL Validation Functions\n",
        "\n",
        "Define helper functions for URL validation:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "validation-functions"
      },
      "outputs": [],
      "source": [
        "def is_pagination_url(url: str) -> bool:\n",
        "    \"\"\"Validate if a URL is a TechCrunch startups category pagination page\"\"\"\n",
        "    # Handle special cases for page 1\n",
        "    if url in [\"https://techcrunch.com/category/startups/\",\n",
        "              \"https://techcrunch.com/category/startups\"]:\n",
        "        return True\n",
        "\n",
        "    # Match pagination URL pattern\n",
        "    return bool(re.match(r\"^https://techcrunch\\.com/category/startups/page/[0-9]+/?$\", url))\n",
        "\n",
        "def is_valid_article_url(url: str) -> bool:\n",
        "    \"\"\"Validate if a URL is a TechCrunch article URL\"\"\"\n",
        "    return bool(re.match(r\"^https://techcrunch\\.com/[0-9]{4}/[0-9]{2}/[0-9]{2}/\", url))\n",
        "\n",
        "def extract_page_number(url: str) -> Optional[int]:\n",
        "    \"\"\"Extract the page number from a pagination URL\"\"\"\n",
        "    # Handle special cases for page 1\n",
        "    if url in [\"https://techcrunch.com/category/startups/\",\n",
        "              \"https://techcrunch.com/category/startups\"]:\n",
        "        return 1\n",
        "\n",
        "    match = re.match(r\"^https://techcrunch\\.com/category/startups/page/([0-9]+)/?$\", url)\n",
        "    return int(match.group(1)) if match else None\n",
        "\n",
        "def is_first_three_pages(url: str) -> bool:\n",
        "    \"\"\"Check if URL is one of the first three pagination pages\"\"\"\n",
        "    page_num = extract_page_number(url)\n",
        "    return page_num is not None and 1 <= page_num <= 3"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "step1-section"
      },
      "source": [
        "## Step 1: Find Pagination Pages\n",
        "\n",
        "Function to discover and validate pagination pages using Firecrawl Map:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "find-pagination"
      },
      "outputs": [],
      "source": [
        "def find_pagination_pages(app: FirecrawlApp) -> List[str]:\n",
        "    \"\"\"Find and validate the first three pagination pages\"\"\"\n",
        "    print(\"Finding pagination pages...\")\n",
        "    base_url = \"https://techcrunch.com/category/startups\"\n",
        "    initial_map = app.map_url(base_url)\n",
        "\n",
        "    if not initial_map.get('success'):\n",
        "        print(\"Failed to map startups category page\")\n",
        "        return []\n",
        "\n",
        "    # Use a set to avoid duplicates\n",
        "    seen_urls = set()\n",
        "    pagination_urls = []\n",
        "\n",
        "    # Add base URL (normalize to version without trailing slash)\n",
        "    base_url = base_url.rstrip('/')\n",
        "    seen_urls.add(base_url)\n",
        "    pagination_urls.append(base_url)\n",
        "\n",
        "    # Process other pagination pages\n",
        "    for url in initial_map['links']:\n",
        "        # Normalize URL by removing trailing slash\n",
        "        url = url.rstrip('/')\n",
        "        if (url not in seen_urls and\n",
        "            is_pagination_url(url) and\n",
        "            is_first_three_pages(url)):\n",
        "            page_num = extract_page_number(url)\n",
        "            if page_num is not None and 1 <= page_num <= 3:\n",
        "                seen_urls.add(url)\n",
        "                pagination_urls.append(url)\n",
        "\n",
        "    # Sort by page number\n",
        "    pagination_urls.sort(key=lambda x: extract_page_number(x) or float('inf'))\n",
        "\n",
        "    print(f\"Found {len(pagination_urls)} pagination pages\")\n",
        "    return pagination_urls"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "step2-section"
      },
      "source": [
        "## Step 2: Extract Article URLs\n",
        "\n",
        "Function to extract and validate article URLs from pagination pages:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "extract-articles"
      },
      "outputs": [],
      "source": [
        "def extract_article_urls(app: FirecrawlApp, pagination_urls: List[str]) -> List[dict]:\n",
        "    \"\"\"Extract article URLs and titles from pagination pages\"\"\"\n",
        "    all_articles = []\n",
        "\n",
        "    print(\"\\nExtracting article URLs and titles from pagination pages...\")\n",
        "    for page_url in pagination_urls:\n",
        "        print(f\"\\nProcessing pagination page: {page_url}\")\n",
        "\n",
        "        try:\n",
        "            response = app.scrape_url(\n",
        "                page_url,\n",
        "                {\n",
        "                    'formats': ['extract'],\n",
        "                    'extract': {\n",
        "                        'schema': {\n",
        "                            'type': 'object',\n",
        "                            'properties': {\n",
        "                                'articles': {\n",
        "                                    'type': 'array',\n",
        "                                    'items': ArticleURLSchema.model_json_schema()\n",
        "                                }\n",
        "                            }\n",
        "                        },\n",
        "                        'prompt': \"\"\"Extract all article URLs and their titles from this page.\n",
        "                                   Only include URLs that follow the pattern techcrunch.com/YYYY/MM/DD/.\n",
        "                                   Return them as a list in the articles field, where each item has\n",
        "                                   article_url and article_title.\"\"\"\n",
        "                    }\n",
        "                }\n",
        "            )\n",
        "\n",
        "            print(\"Response received:\", response)\n",
        "\n",
        "            if 'extract' in response and 'articles' in response['extract']:\n",
        "                extracted_articles = response['extract']['articles']\n",
        "                print(f\"Extracted {len(extracted_articles)} articles\")\n",
        "\n",
        "                valid_articles = [\n",
        "                    article for article in extracted_articles\n",
        "                    if is_valid_article_url(article['article_url'])\n",
        "                ]\n",
        "                all_articles.extend(valid_articles)\n",
        "                print(f\"Added {len(valid_articles)} valid articles to processing list\")\n",
        "            else:\n",
        "                print(\"No valid articles found in response\")\n",
        "                print(\"Response content:\", response)\n",
        "\n",
        "        except Exception as e:\n",
        "            print(f\"Error processing page {page_url}: {str(e)}\")\n",
        "\n",
        "    return all_articles"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "step3-section"
      },
      "source": [
        "## Step 3: Extract Investment Data\n",
        "\n",
        "Function to extract investment data from individual articles:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "extract-investments"
      },
      "outputs": [],
      "source": [
        "def extract_investment_data(app: FirecrawlApp, articles: List[dict]) -> List[dict]:\n",
        "    \"\"\"Extract investment data from articles with improved validation\"\"\"\n",
        "    processed_articles = []\n",
        "\n",
        "    print(f\"\\nProcessing {len(articles)} articles...\")\n",
        "    for article in articles:\n",
        "        print(f\"\\nExtracting investment data from: {article['article_url']}\")\n",
        "        print(f\"Article title: {article['article_title']}\")\n",
        "\n",
        "        try:\n",
        "            response = app.scrape_url(\n",
        "                article['article_url'],\n",
        "                {\n",
        "                    'formats': ['extract'],\n",
        "                    'extract': {\n",
        "                        'schema': InvestmentSchema.model_json_schema(),\n",
        "                        'prompt': \"\"\"Extract investment information from this TechCrunch article.\n",
        "                                   For the investor field, extract the name of the investing entity (company, VC firm, or individual).\n",
        "                                   For the investee field, extract the name of the company receiving the investment.\n",
        "                                   For investment_amount, extract the dollar amount in the following format:\n",
        "                                   {\n",
        "                                       \"Price\": {\n",
        "                                           \"name\": \"Price\",\n",
        "                                           \"type\": \"number\",\n",
        "                                           \"number\": {\n",
        "                                               \"format\": \"dollar\",\n",
        "                                               \"value\": <numerical_value>\n",
        "                                           }\n",
        "                                       }\n",
        "                                   }\n",
        "                                   For funding_stage, extract the stage of funding (e.g., Seed, Series A, Series B, etc.).\n",
        "                                   For focus_area, extract a brief description of what the investee company does or their main business area.\n",
        "                                   For publication_date, extract the article's publication date in YYYY-MM-DD format.\n",
        "                                   Return null for any fields where the information is not clearly stated.\n",
        "                                   Only extract actual investment amounts, not revenue, valuations, or other financial metrics.\"\"\"\n",
        "                    }\n",
        "                }\n",
        "            )\n",
        "\n",
        "            if response.get('extract'):\n",
        "                investment_data = response['extract']\n",
        "                print(\"\\nExtracted data:\", investment_data)\n",
        "\n",
        "                # Check for investment amount in the required format\n",
        "                amount = investment_data.get('investment_amount')\n",
        "                has_valid_amount = False\n",
        "\n",
        "                if amount and isinstance(amount, dict):\n",
        "                    price_info = amount.get('Price')\n",
        "                    if (isinstance(price_info, dict) and\n",
        "                        price_info.get('number', {}).get('value') is not None):\n",
        "                        has_valid_amount = True\n",
        "                        print(f\"Found investment amount: {price_info['number']['value']}\")\n",
        "\n",
        "                if has_valid_amount:\n",
        "                    # Create processed record with new fields\n",
        "                    processed_record = {\n",
        "                        'investor': investment_data.get('investor', 'Not specified'),\n",
        "                        'investee': investment_data.get('investee', 'Not specified'),\n",
        "                        'investment_amount': amount,\n",
        "                        'funding_stage': investment_data.get('funding_stage', 'Not specified'),\n",
        "                        'focus_area': investment_data.get('focus_area', 'Not specified'),\n",
        "                        'publication_date': investment_data.get('publication_date', 'Not specified'),\n",
        "                        'article_title': article['article_title'],\n",
        "                        'article_url': article['article_url']\n",
        "                    }\n",
        "                    processed_articles.append(processed_record)\n",
        "                    print(\"✓ Successfully extracted and validated investment data\")\n",
        "                else:\n",
        "                    print(\"✗ No valid investment amount found in required format\")\n",
        "\n",
        "                # Show all extracted fields for debugging\n",
        "                for field in ['investor', 'investee', 'funding_stage', 'focus_area', 'publication_date']:\n",
        "                    if investment_data.get(field):\n",
        "                        print(f\"Found {field}: {investment_data[field]}\")\n",
        "            else:\n",
        "                print(\"✗ No data extracted from article\")\n",
        "                print(\"Raw response:\", response)\n",
        "\n",
        "        except Exception as e:\n",
        "            print(f\"✗ Error processing article: {str(e)}\")\n",
        "\n",
        "        print(\"-\" * 80)\n",
        "\n",
        "    print(f\"\\nSummary: Successfully processed {len(processed_articles)} articles with valid investment data\")\n",
        "    if processed_articles:\n",
        "        print(\"\\nExtracted investments:\")\n",
        "        for article in processed_articles:\n",
        "            print(f\"\\n• Amount: {article['investment_amount']['Price']['number']['value']}\")\n",
        "            print(f\"  Investor: {article['investor']}\")\n",
        "            print(f\"  Investee: {article['investee']}\")\n",
        "            print(f\"  Funding Stage: {article['funding_stage']}\")\n",
        "            print(f\"  Focus Area: {article['focus_area']}\")\n",
        "            print(f\"  Publication Date: {article['publication_date']}\")\n",
        "            print(f\"  Article: {article['article_title']}\")\n",
        "            print(f\"  URL: {article['article_url']}\")\n",
        "\n",
        "    return processed_articles\n",
        "\n",
        "def parse_investment_amount(amount_dict):\n",
        "    \"\"\"Parse investment amount and currency from the amount dictionary\"\"\"\n",
        "    if not amount_dict or 'Price' not in amount_dict:\n",
        "        return None, None\n",
        "\n",
        "    try:\n",
        "        value = amount_dict['Price']['number']['value']\n",
        "        currency = 'USD' if amount_dict['Price']['number']['format'] == 'dollar' else None\n",
        "        return value, currency\n",
        "    except (KeyError, TypeError):\n",
        "        return None, None"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "main-section"
      },
      "source": [
        "## Main Execution Function\n",
        "\n",
        "Combine all steps into a single function:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "main-function"
      },
      "outputs": [],
      "source": [
        "def scrape_techcrunch_investments(max_articles_to_process) -> List[dict]:\n",
        "    \"\"\"Main function to orchestrate the TechCrunch investment data scraping process\"\"\"\n",
        "    # Step 1: Find pagination pages\n",
        "    pagination_urls = find_pagination_pages(app)\n",
        "    if not pagination_urls:\n",
        "        return []\n",
        "\n",
        "    # Step 2: Extract article URLs\n",
        "    article_urls = extract_article_urls(app, pagination_urls)\n",
        "    if not article_urls:\n",
        "        return []\n",
        "\n",
        "    # Step 3: Extract investment data\n",
        "    total_articles = len(article_urls)\n",
        "\n",
        "    if max_articles_to_process > 0:\n",
        "        if max_articles_to_process > total_articles:\n",
        "            print(f\"\\nRequested to process {max_articles_to_process} articles but only {total_articles} articles available.\")\n",
        "            print(f\"Will process all {total_articles} available articles...\")\n",
        "        else:\n",
        "            print(f\"\\nLimiting processing to {max_articles_to_process} articles...\")\n",
        "        articles_to_process = article_urls[:max_articles_to_process]\n",
        "    else:\n",
        "        print(f\"\\nNo limit set - processing all {total_articles} found articles...\")\n",
        "        articles_to_process = article_urls\n",
        "\n",
        "    return extract_investment_data(app, articles_to_process)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "execution-section"
      },
      "source": [
        "## Execute the Scraper\n",
        "\n",
        "Run the scraper and process the results:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "run-scraper"
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "\n",
        "# Set the maximum number of articles to process (change as needed)\n",
        "# If set to 0, no limit will be imposed on the number of articles to process\n",
        "max_articles_to_process = 0  # Example default limit\n",
        "\n",
        "# Run the scraper (API calls happen only here)\n",
        "print(f\"Starting scraper with{'out a limit' if max_articles_to_process == 0 else f' a limit of {max_articles_to_process}'} articles...\")\n",
        "results = scrape_techcrunch_investments(max_articles_to_process)\n",
        "\n",
        "# Helper function to extract amount and currency\n",
        "def parse_investment_amount(amount_dict):\n",
        "    if not amount_dict or 'Price' not in amount_dict:\n",
        "        return None, None\n",
        "\n",
        "    try:\n",
        "        value = amount_dict['Price']['number']['value']\n",
        "        currency = 'USD' if amount_dict['Price']['number']['format'] == 'dollar' else None\n",
        "        return value, currency\n",
        "    except (KeyError, TypeError):\n",
        "        return None, None\n",
        "\n",
        "# Process the results (no API calls)\n",
        "processed_results = [\n",
        "    {\n",
        "        'amount': parse_investment_amount(article['investment_amount'])[0],\n",
        "        'currency': parse_investment_amount(article['investment_amount'])[1],\n",
        "        'investor': article.get('investor', 'Not specified'),\n",
        "        'investee': article.get('investee', 'Not specified'),\n",
        "        'funding_stage': article.get('funding_stage', 'Not specified'),\n",
        "        'focus_area': article.get('focus_area', 'Not specified'),\n",
        "        'publication_date': article.get('publication_date', 'Not specified'),\n",
        "        'article_title': article['article_title'],\n",
        "        'article_url': article['article_url']\n",
        "    }\n",
        "    for article in results\n",
        "]\n",
        "\n",
        "# Create DataFrame (no API calls)\n",
        "df = pd.DataFrame(processed_results)\n",
        "\n",
        "# Print results summary (no API calls)\n",
        "print(f\"\\nFound {len(results)} articles with investment data\")\n",
        "for result in processed_results:\n",
        "    print(f\"\\nInvestment: {result['amount']} {result['currency']}\")\n",
        "    print(f\"Investor: {result['investor']}\")\n",
        "    print(f\"Investee: {result['investee']}\")\n",
        "    print(f\"Funding Stage: {result['funding_stage']}\")\n",
        "    print(f\"Focus Area: {result['focus_area']}\")\n",
        "    print(f\"Publication Date: {result['publication_date']}\")\n",
        "    print(f\"Article: {result['article_title']}\")\n",
        "    print(f\"URL: {result['article_url']}\")\n",
        "\n",
        "# Display the DataFrame\n",
        "print(\"\\nDataFrame of results:\")\n",
        "display(df)"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Exporting Extracted Content\n",
        "\n",
        "After extracting the content, we have several options for exporting and storing the data. In this notebook, we'll demonstrate two export methods:\n",
        "\n",
        "1. Exporting to Rentry.co, a simple pastebin-like service\n",
        "2. Exporting to Google Docs"
      ],
      "metadata": {
        "id": "zFyMkqoob6bL"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Export to Google Docs\n",
        "\n",
        "Google Docs offers a versatile platform for storing, sharing, and formatting scraped investment data, ideal for:\n",
        "\n",
        "1. Long-term storage  \n",
        "2. Collaborative analysis  \n",
        "3. Professional reporting  \n",
        "4. Controlled stakeholder sharing  \n",
        "\n",
        "### Advantages:\n",
        "- No character limits  \n",
        "- Rich formatting capabilities  \n",
        "- Access control and sharing  \n",
        "- Version tracking  \n",
        "- Collaboration features  \n",
        "\n",
        "### Data Exported:\n",
        "- Investment amount and currency  \n",
        "- Investor and investee details  \n",
        "- Funding stage  \n",
        "- Company focus  \n",
        "- Publication dates  \n",
        "- Article titles and URLs  \n",
        "\n",
        "The export process will:  \n",
        "1. Authenticate with the Google Docs API  \n",
        "2. Create a timestamped document  \n",
        "3. Format content for readability  \n",
        "4. Provide a document link  \n",
        "\n",
        "Note: This method requires Google Cloud credentials and API setup but offers greater flexibility and professional presentation compared to Rentry.co."
      ],
      "metadata": {
        "id": "-MGNFZn8cxyN"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wDJGT00nbGQb"
      },
      "outputs": [],
      "source": [
        "# Install required packages\n",
        "%pip install gspread oauth2client pandas --quiet\n",
        "\n",
        "import gspread\n",
        "from oauth2client.service_account import ServiceAccountCredentials\n",
        "from google.colab import files\n",
        "import pandas as pd\n",
        "from datetime import datetime\n",
        "import traceback  # Added for better error tracking\n",
        "\n",
        "def export_df_to_gsheets(df, spreadsheet_name=None):\n",
        "    \"\"\"Export DataFrame to Google Sheets using service account\"\"\"\n",
        "    try:\n",
        "        print(\"Please upload your Google Service Account JSON key file...\")\n",
        "        uploaded = files.upload()\n",
        "\n",
        "        if not uploaded:\n",
        "            raise ValueError(\"No file was uploaded\")\n",
        "\n",
        "        # Get the filename of the uploaded credentials\n",
        "        creds_filename = list(uploaded.keys())[0]\n",
        "        print(f\"Using credentials file: {creds_filename}\")\n",
        "\n",
        "        # Define the scope\n",
        "        scope = [\n",
        "            'https://spreadsheets.google.com/feeds',\n",
        "            'https://www.googleapis.com/auth/spreadsheets',\n",
        "            'https://www.googleapis.com/auth/drive'\n",
        "        ]\n",
        "\n",
        "        try:\n",
        "            # Authenticate using service account\n",
        "            print(\"Attempting to authenticate...\")\n",
        "            creds = ServiceAccountCredentials.from_json_keyfile_name(creds_filename, scope)\n",
        "            client = gspread.authorize(creds)\n",
        "            print(\"Authentication successful!\")\n",
        "        except Exception as auth_error:\n",
        "            print(f\"Authentication failed: {str(auth_error)}\")\n",
        "            print(traceback.format_exc())\n",
        "            return False\n",
        "\n",
        "        # Generate default spreadsheet name if none provided\n",
        "        if not spreadsheet_name:\n",
        "            timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n",
        "            spreadsheet_name = f\"TechCrunch_Investment_Data_{timestamp}\"\n",
        "\n",
        "        print(f\"Creating spreadsheet: {spreadsheet_name}\")\n",
        "\n",
        "        try:\n",
        "            # Create new spreadsheet\n",
        "            spreadsheet = client.create(spreadsheet_name)\n",
        "            print(\"Spreadsheet created successfully!\")\n",
        "\n",
        "            # Share with anyone who has the link\n",
        "            spreadsheet.share(None, perm_type='anyone', role='reader')\n",
        "            print(\"Sharing settings updated!\")\n",
        "\n",
        "            # Select the first sheet\n",
        "            worksheet = spreadsheet.get_worksheet(0)\n",
        "\n",
        "            # Convert DataFrame to list of lists\n",
        "            print(\"Preparing data for upload...\")\n",
        "            data = [df.columns.values.tolist()] + df.fillna('Not specified').values.tolist()\n",
        "\n",
        "            # Update the sheet\n",
        "            print(\"Uploading data...\")\n",
        "            worksheet.update(data)\n",
        "\n",
        "            # Format header row\n",
        "            print(\"Formatting spreadsheet...\")\n",
        "            worksheet.format('A1:Z1', {\n",
        "                'textFormat': {'bold': True},\n",
        "                'backgroundColor': {'red': 0.8, 'green': 0.8, 'blue': 0.8}\n",
        "            })\n",
        "\n",
        "            # Auto-resize columns\n",
        "            worksheet.columns_auto_resize(0, len(df.columns))\n",
        "\n",
        "            print(f\"\\nSuccessfully exported data to Google Sheets!\")\n",
        "            print(f\"Spreadsheet URL: https://docs.google.com/spreadsheets/d/{spreadsheet.id}\")\n",
        "            return True\n",
        "\n",
        "        except Exception as sheet_error:\n",
        "            print(f\"Error during spreadsheet operations: {str(sheet_error)}\")\n",
        "            print(traceback.format_exc())\n",
        "            return False\n",
        "\n",
        "    except Exception as e:\n",
        "        print(f\"\\nGeneral error during export: {str(e)}\")\n",
        "        print(traceback.format_exc())\n",
        "        return False\n",
        "\n",
        "# Execute the export\n",
        "print(\"Exporting DataFrame to Google Sheets...\")\n",
        "timestamp = datetime.now().strftime(\"%Y%m%d_%H%M%S\")\n",
        "spreadsheet_name = f\"TechCrunch_Investment_Data_{timestamp}\"\n",
        "success = export_df_to_gsheets(df, spreadsheet_name)\n",
        "\n",
        "if success:\n",
        "    print(\"\\nExport completed successfully!\")\n",
        "else:\n",
        "    print(\"\\nExport failed. Please check the error messages above.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vlVXZw2nbGQa"
      },
      "source": [
        "## Export to Rentry.co\n",
        "\n",
        "Rentry.co is a simple, markdown-friendly pastebin service that allows us to quickly share our scraped investment data. This export method is ideal for:\n",
        "\n",
        "1. Quick sharing of results\n",
        "2. Markdown-formatted data viewing\n",
        "3. Temporary storage of scraped data\n",
        "\n",
        "### Limitations:\n",
        "- Maximum content length: 200,000 characters\n",
        "- Basic formatting only (markdown)\n",
        "- Public access (anyone with the URL can view)\n",
        "\n",
        "### What gets exported:\n",
        "- Investment amount and currency\n",
        "- Investor and investee details\n",
        "- Funding stage information\n",
        "- Company focus areas\n",
        "- Publication dates\n",
        "- Article titles and URLs\n",
        "\n",
        "The export process will:\n",
        "1. Clean and format the data as markdown\n",
        "2. Create a new Rentry post with a random URL\n",
        "3. Return both the viewing URL and an edit code for future updates\n",
        "\n",
        "Note: If your dataset exceeds 200,000 characters, consider using alternative export methods like Google Docs or splitting the data into multiple Rentry posts."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "p32ZCM-lbGQa"
      },
      "outputs": [],
      "source": [
        "# @title Export to Rentry.com\n",
        "\n",
        "# Install required package\n",
        "%pip install tabulate --quiet\n",
        "\n",
        "import requests\n",
        "import os\n",
        "import re\n",
        "from datetime import datetime\n",
        "\n",
        "def format_markdown_content(df):\n",
        "    \"\"\"Format DataFrame as markdown with title and timestamp\"\"\"\n",
        "    current_time = datetime.now().strftime(\"%Y-%m-%d %H:%M:%S\")\n",
        "\n",
        "    markdown_content = f\"\"\"# TechCrunch Investment Data\n",
        "Generated on: {current_time}\n",
        "\n",
        "## Investment Summary\n",
        "Total articles analyzed: {len(df)}\n",
        "Total investments found: {len(df[df['amount'].notna()])}\n",
        "\n",
        "## Full Dataset\n",
        "\"\"\"\n",
        "    # Convert DataFrame to markdown, handling NaN values\n",
        "    table_markdown = df.fillna('Not specified').to_markdown(index=False)\n",
        "    markdown_content += table_markdown\n",
        "\n",
        "    return markdown_content\n",
        "\n",
        "def post_to_rentry(content):\n",
        "    \"\"\"Post content to Rentry and return URL and edit code\"\"\"\n",
        "    base_url = 'https://rentry.co'\n",
        "    api_url = f\"{base_url}/api/new\"\n",
        "\n",
        "    # Create session and get CSRF token\n",
        "    session = requests.Session()\n",
        "    response = session.get(base_url)\n",
        "    csrf_token = session.cookies.get('csrftoken')\n",
        "\n",
        "    # Prepare payload\n",
        "    payload = {\n",
        "        'csrfmiddlewaretoken': csrf_token,\n",
        "        'url': '',  # Empty for random URL\n",
        "        'edit_code': '',  # Empty for random edit code\n",
        "        'text': content\n",
        "    }\n",
        "\n",
        "    headers = {\n",
        "        \"Referer\": base_url,\n",
        "        \"X-CSRFToken\": csrf_token\n",
        "    }\n",
        "\n",
        "    try:\n",
        "        response = session.post(api_url, data=payload, headers=headers)\n",
        "        result = response.json()\n",
        "\n",
        "        if result.get('status') == '200':\n",
        "            return result.get('url'), result.get('edit_code')\n",
        "        else:\n",
        "            print(f\"Error posting to Rentry: {result}\")\n",
        "            return None, None\n",
        "\n",
        "    except Exception as e:\n",
        "        print(f\"Exception while posting to Rentry: {e}\")\n",
        "        return None, None\n",
        "\n",
        "def export_df_to_rentry(df):\n",
        "    \"\"\"Main function to export DataFrame to Rentry\"\"\"\n",
        "    # Format content\n",
        "    content = format_markdown_content(df)\n",
        "\n",
        "    # Check content length\n",
        "    if len(content) > 200000:\n",
        "        print(\"Content exceeds Rentry's 200,000 character limit!\")\n",
        "        return None, None\n",
        "\n",
        "    # Post to Rentry\n",
        "    url, edit_code = post_to_rentry(content)\n",
        "\n",
        "    if url and edit_code:\n",
        "        print(\"\\nSuccessfully exported to Rentry!\")\n",
        "        print(f\"View URL: {url}\")\n",
        "        print(f\"Edit code: {edit_code}\")\n",
        "        print(\"\\nSave these details if you want to update the document later.\")\n",
        "        return url, edit_code\n",
        "    else:\n",
        "        print(\"\\nFailed to export to Rentry.\")\n",
        "        return None, None\n",
        "\n",
        "# Execute the export\n",
        "rentry_url, rentry_edit_code = export_df_to_rentry(df)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "provenance": [],
      "toc_visible": true,
      "include_colab_link": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.9.19"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
 }
No results found