Skip to content

Instantly share code, notes, and snippets.

@RajMaharajwala
Last active April 9, 2024 09:25
Show Gist options
  • Select an option

  • Save RajMaharajwala/d4c7f6d0028d2c6082b7380aa1f75851 to your computer and use it in GitHub Desktop.

Select an option

Save RajMaharajwala/d4c7f6d0028d2c6082b7380aa1f75851 to your computer and use it in GitHub Desktop.
This GIST id

Dragonfruit.ai Challenge Directory Structure and Execution Guide

This challenge directory contains the following files and directories:

  • challenge.py
  • RajMaharajwala_Dragonfruit_ai.md
  • requirements.txt
  • data_1000 (created after execution of challenge.py)

Directory Description:

  • challenge.py: This is a Python script that is the main code file for the project. It contains functions related to image processing, generation, and analysis as indicated in the project description.
  • RajMaharajwala_Dragonfruit_ai.md: This file contains answers related to the challenge.
  • requirements.txt: This file typically lists the dependencies and packages required to run the project successfully. It helps ensure that the necessary libraries are installed in the environment.
  • If you test the script challenge.py for Image size 1000x1000 pixel then the directory name data_1000would be created: This directory contain images generated by the code provided in challenge.py. Ans also within the data_1000 there would be one subdirectory named final where final images for microscope and cancer would be stored. The numbers in the directory names suggest size of image used for testing or analysis.

Execution Guide:

To execute the project or run the provided code, follow these steps:

  1. Ensure that Python is installed on your system. You can download and install Python from the official Python website.

  2. Install the required dependencies listed in requirements.txt. You can do this using the following command in your terminal or command prompt:

    pip install -r requirements.txt
    
  3. Once the dependencies are installed, you can run the challenge.py script to perform image processing, generation, and analysis. Use the following command:

    python challenge.py
    
  4. The script may take some time to execute, depending on the size of the images and the complexity of the operations performed. (For faster and justifiable execution currently I have set the image size as 1000x1000). In actual scenario as per the project the size of the image should be 100,000x100,000 pixles.

  5. After execution, you can explore the generated output files and directories to analyze the results or further investigate the project's outcomes. Directory would be name by data_.

Answers:

For more detailed explanations and solutions related to the challenge, refer to the document RajMaharajwala_Dragonfruit_ai.md.

import numpy as np
import random
from PIL import Image
import matplotlib.pyplot as plt
from multiprocessing import Pool
import time
import os
def create_data_folders(size):
data_folder = f'data_{size}'
final_folder = os.path.join(data_folder, 'final')
if not os.path.exists(data_folder):
os.makedirs(data_folder)
print(f"Folder '{data_folder}' created successfully.")
else:
print(f"Folder '{data_folder}' already exists.")
if not os.path.exists(final_folder):
os.makedirs(final_folder)
print(f"Folder '{final_folder}' created successfully.")
else:
print(f"Folder '{final_folder}' already exists.")
def generate_image(height, width):
# Calculate centroid
centroid_x = width // 2
centroid_y = height // 2
# Calculate side length
side = int(np.sqrt(height * width * 0.25 / 3))
# Create an empty image
image = np.zeros((height, width), dtype=np.uint8)
# Calculate the position of the square blob
start_x = centroid_x - side // 2
end_x = start_x + side
start_y = centroid_y - side // 2
end_y = start_y + side
# Add the square blob in the center
image[start_y:end_y, start_x:end_x] = 255
# Randomly select two adjacent sides
selected_sides = random.sample(['top', 'right', 'bottom', 'left'], 2)
# Add the adjacent square blobs
for selected_side in selected_sides:
if selected_side == 'top':
adjacent_start_x = centroid_x - side // 2
adjacent_end_x = centroid_x + side // 2
adjacent_start_y = centroid_y - side - side // 2
adjacent_end_y = centroid_y - side // 2
elif selected_side == 'right':
adjacent_start_x = centroid_x + side // 2
adjacent_end_x = centroid_x + side + side // 2
adjacent_start_y = centroid_y - side // 2
adjacent_end_y = centroid_y + side // 2
elif selected_side == 'bottom':
adjacent_start_x = centroid_x - side // 2
adjacent_end_x = centroid_x + side // 2
adjacent_start_y = centroid_y + side // 2
adjacent_end_y = centroid_y + side + side // 2
else: # selected_side == 'left'
adjacent_start_x = centroid_x - side - side // 2
adjacent_end_x = centroid_x - side // 2
adjacent_start_y = centroid_y - side // 2
adjacent_end_y = centroid_y + side // 2
# Add the adjacent square blob
image[adjacent_start_y:adjacent_end_y, adjacent_start_x:adjacent_end_x] = 255
# # Create PIL Image object from NumPy array
# microimg = Image.fromarray(image, 'L')
# # Save the image as PNG
# microimg.save('D:/dragonfruit/microimage.png')
return image
def dye_image(height, width):
# Create an empty binary image
image = np.zeros((height, width), dtype=np.uint8)
# Pick a random starting point
x = random.randint(0, width - 1)
y = random.randint(0, height - 1)
# Generate the lines
for _ in range(2*height):
direction = random.choice(['up', 'down', 'left', 'right', 'diagonal_up_right', 'diagonal_up_left', 'diagonal_down_right', 'diagonal_down_left'])
length = random.randint(10, 20)
if direction == 'up':
y -= length
y = max(y, 0)
image[y:y+length, x] = 255
elif direction == 'down':
y += length
y = min(y, height - 1)
image[y-length:y, x] = 255
elif direction == 'left':
x -= length
x = max(x, 0)
image[y, x:x+length] = 255
elif direction == 'right':
x += length
x = min(x, width - 1)
image[y, x-length:x] = 255
elif direction == 'diagonal_up_right':
for i in range(length):
y -= 1
x += 1
y = max(y, 0)
x = min(x, width - 1)
image[y, x] = 255
elif direction == 'diagonal_up_left':
for i in range(length):
y -= 1
x -= 1
y = max(y, 0)
x = max(x, 0)
image[y, x] = 255
elif direction == 'diagonal_down_right':
for i in range(length):
y += 1
x += 1
y = min(y, height - 1)
x = min(x, width - 1)
image[y, x] = 255
elif direction == 'diagonal_down_left':
for i in range(length):
y += 1
x -= 1
y = min(y, height - 1)
x = max(x, 0)
image[y, x] = 255
return image
# Calculate the number of white pixels in the image
def count_white_pixels(image):
return np.sum(image == 255)
def subtract_images(image1, image2):
result = image1 - image2
result = np.clip(result, 0, 255) # Clip values to the range [0, 255]
return result.astype(np.uint8)
def merge_chunks(image_chunks, height, width, chunk_height, chunk_width):
# Function to merge image chunks into a single image
merged_image = np.zeros((height, width), dtype=np.uint8)
row_idx = 0
col_idx = 0
for chunk in image_chunks:
merged_image[row_idx:row_idx + chunk_height, col_idx:col_idx + chunk_width] = chunk
col_idx += chunk_width
if col_idx >= width:
col_idx = 0
row_idx += chunk_height
return merged_image
def main(height, width, num_samples=3):
create_data_folders(height)
chunk_size = 100 # Chunk size for height and width
num_chunks = height // chunk_size # Number of chunks in each dimension
for i in range(num_samples):
# Initialize empty list to collect micro and cancer image chunks
micro_chunks = []
cancer_chunks = []
for row in range(num_chunks):
for col in range(num_chunks):
# Generate micro image chunk
micro_chunk = generate_image(chunk_size, chunk_size)
micro_chunks.append(micro_chunk)
# Generate dye image chunk
dye_chunk = dye_image(chunk_size, chunk_size)
# Subtract dye from micro image chunk to get cancer image chunk
cancer_chunk = subtract_images(micro_chunk, dye_chunk)
cancer_chunks.append(cancer_chunk)
# Save micro and cancer images
micro_img = Image.fromarray(micro_chunk, 'L')
micro_img.save(f'Data_{height}/micro_image_{i}_{row}_{col}.png')
cancer_img = Image.fromarray(cancer_chunk, 'L')
cancer_img.save(f'Data_{height}/cancer_image_{i}_{row}_{col}.png')
# Merge micro and cancer image chunks
merged_micro_image = merge_chunks(micro_chunks, height, width, chunk_size, chunk_size)
merged_cancer_image = merge_chunks(cancer_chunks, height, width, chunk_size, chunk_size)
# Count white pixels for micro and cancer images
micro_pixel_count = count_white_pixels(merged_micro_image)
cancer_pixel_count = count_white_pixels(merged_cancer_image)
# Calculate the percentage of dye present in parasite
if micro_pixel_count != 0:
cancer_threshold = (1 - (cancer_pixel_count / micro_pixel_count)) * 100
else:
cancer_threshold = float('inf') # Handle division by zero
print(f"Sample {i+1}:")
print("Micro image white pixel count:", micro_pixel_count)
print("Cancer image white pixel count:", cancer_pixel_count)
print("Dye present in parasite (%):", cancer_threshold)
if cancer_threshold > 10:
print("Cancer detected")
else:
print("Cancer not detected")
print()
# Save final merged micro and cancer images
final_merged_micro_img = Image.fromarray(merged_micro_image, 'L')
final_merged_micro_img.save(f'Data_{height}/final/merged_micro_image_{i}.png')
final_merged_cancer_img = Image.fromarray(merged_cancer_image, 'L')
final_merged_cancer_img.save(f'Data_{height}/final/merged_cancer_image_{i}.png')
if __name__ == "__main__":
start_time = time.time() # Record the start time
main(1000,1000) # Call the main function
#main(100000,100000)
end_time = time.time() # Record the end time
# Calculate the elapsed time
elapsed_time = end_time - start_time
print(f"Execution time: {elapsed_time} seconds")

Q1. Come up with efficient data structures to represent both types of images: those generated by the microscope, and those generated by the dye sensor. These need not have the same representation; the only requirement is that they be compact and take as little storage space as possible. Explain why you picked the representation you did for each image type, and if possible estimate how much storage would be taken by the images. What is the worst-case storage size in bytes for each image representation you chose?

To represent the images efficiently, we need to consider the following factors:

  • Compactness: The data structures should use as little memory as possible to store the images.
  • Ease of manipulation: The data structures should allow for easy manipulation and processing of the images.
  • Compatibility with existing libraries: The chosen data structures should be compatible with libraries like NumPy and PIL for image processing. Based on these considerations, here are the data structures and explanations for representing the microscope and dye sensor images:

Microscope Images: For microscope images, the key characteristics are that they are grayscale and binary (each pixel has values of either 0 or 255).

Data Structure: NumPy arrays with data type np.uint8 (8-bit unsigned integers).

Explanation:

  • NumPy arrays are efficient for handling large datasets in Python.
  • The np.uint8 data type allows us to represent each pixel with 8 bits (1 byte), which is the minimum memory required for grayscale images.
  • Grayscale images only need one channel, so they can be represented as 2D arrays.
  • This data structure enables easy manipulation and processing of the images using NumPy's built-in functions. Estimated Storage: For a 100,000 x 100,000 grayscale image: Total number of pixels = 100,000 x 100,000 = 10,000,000,000 pixels Storage required = 10,000,000,000 pixels x 1 byte/pixel = 10,000,000,000 bytes Converting to gigabytes: 10,000,000,000 bytes / (1024 x 1024 x 1024) ≈ 9.31 GB

Dye Sensor Images: Dye sensor images are also grayscale and binary, similar to microscope images. However, they contain random patterns generated by the dye sensor.

Data Structure: Same as microscope images - NumPy arrays with data type np.uint8.

Explanation:

  • Since dye sensor images share similar characteristics with microscope images, using the same data structure (NumPy arrays with np.uint8 dtype) is appropriate.
  • The data structure allows for easy storage, manipulation, and processing of the dye sensor images.

Estimated Storage: Same as microscope images - approximately 9.31 GB for a 100,000 x 100,000 grayscale image.



Q2. Before the researchers give you real images to work with, you would like to test out any code you write. To this end, you would like to create “fake” simulated images and pretend they were captured by the microscope and the dye sensor. Using the data structures you chose in (1) above, write code to create such simulated images. Try and be as realistic in the generated images as possible.

Q3. Using the simulated images generated by the code you wrote for (2) above as input, write a function to compute whether a parasite has cancer or not.

For Question 2, and Question 3, I came up with this idea to use square blocks to build a magnified image of a parasite. I started with a block in the center and randomly added eight more around it, making sure they filled at least a quarter of the image. I did some calculations to ensure this. For the dye image, I used a similar logic: I'd pick a random spot and draw a line in a random direction, up to 10 pixels long. I kept doing this until the lines covered an area twice the image height. Then, I subtracted the dye image from the parasite image to see where the dye stained the parasite. Finally, I checked if the white pixels (representing dye) covered more than 10% of the parasite area. If so, that meant there was probably cancer. I used libraries like NumPy, Pillow, and matplotlib for this, and it's super easy to run: just install the required stuff and use python challenge.py. I wanted to share a program I've been working on to simulate microscope and dye sensor images for detecting cancerous regions. Here's a breakdown of how it works:

Folder Creation: I've set up a function to create folders to organize the generated images.

Image Generation:

  • For microscope images, I've implemented a method that generates square blobs in the center and randomly places adjacent square blobs around it.
  • As for dye sensor images, I've created a function that draws random lines in various directions to simulate dye concentration.

Image Manipulation:

  • To isolate the dye regions within the parasite, I subtract the dye sensor image from the microscope image.

Image Analysis:

  • I've included a function to count the number of white pixels in an image. This helps in calculating the percentage of dye present in the parasite.

Image Saving: Once the images are generated and analyzed, I save them in the appropriate folders.

Detection of Cancerous Regions:

  • Based on the percentage of dye present in the parasite, the program determines whether cancerous regions are detected. If the percentage exceeds 10%, it flags it as cancer.

Execution Time Calculation: Lastly, I measure the execution time for generating and analyzing the images. I've used common libraries like NumPy, Pillow, and matplotlib for image manipulation and analysis. I believe this program could be helpful for simulating cancer detection scenarios.



Q4. You give your code from (3) to the researchers, who run it and find that it is running too slowly for their liking. What can you do to improve the execution speed? Write the code to implement the fastest possible version you can think of for the function in (3).

I've explored an approach geared towards optimizing speed. I utilized NumPy arrays to represent images and employed NumPy slicing, known for its efficiency in most cases. Rather than employing a sliding technique, I devised a subtraction-based method. Pixel calculations were executed using NumPy, with the sum() function utilized for counting white pixels. Previously, a common practice in MATLAB involved working with images containing pixel values of either 1 or NaN. During image processing, NaN pixels would automatically be excluded. My aim was to concentrate solely on the parasite region, which I intended to mark as white. This approach aimed not only to streamline processing but also to conserve space by storing only pixels with a value of 1. Despite my efforts, this approach did not yield the desired results and encountered unexpected delays.



Q5. What other compression techniques can you suggest for both types of images (parasite and dye)? How would they impact runtime? Can you compute actual runtime and storage costs for typical images (not oversimplified images such as a circle for the parasite, or simple straight lines or random points for dye) in your code? The measurements should be done on your computer with an actual image size of 100,000x100,000 pixels (and not a scaled down version).

I explored some additional compression techniques for both parasite and dye images, considering how they might impact performance and storage space.

  • Lossless techniques:

    • Run-length encoding (RLE): Useful for large areas of uniform color within the parasite, but less effective for complex patterns.
    • Huffman coding: Adapts to pixel value frequency, potentially achieving good compression for skewed color distributions.
    • Dictionary-based: Techniques like LZMA or Zstd can outperform RLE/Huffman, especially for repeated patterns.
  • Lossy techniques:

    • JPEG: Classic technique balancing quality and compression. Might introduce blurring, but significantly reduces file size. Also, Max Pooling strategy which we use while training CNN Models.

Runtime impact:

  • Lossless techniques generally have lower runtime overhead due to simpler decompression.
  • Among lossy techniques, JPEG is usually faster than wavelet or fractal compression.
  • Dictionary-based techniques have varying runtime costs depending on the specific algorithm and dictionary.

Storage costs:

  • Lossless techniques typically achieve lower compression ratios compared to lossy techniques, resulting in larger file sizes.
  • Among lossy techniques, the compression ratio and resulting storage size depend on the chosen quality level. Higher quality settings typically lead to larger file sizes.


Q6. Describes what tools you used to solve the challenge, particularly any LLM techniques.

  • Utilized Stack Overflow for troubleshooting and accessing community-driven solutions to coding challenges.
  • Leveraged ChatGPT to refine understanding of concepts, brainstorm potential solutions, and seek validation for ideas.
  • Explored Google Bard for reference code structures and boilerplate templates, facilitating the structuring of code in an effective manner.
  • Conducted comprehensive searches using Google to verify the validity of approaches and gather additional insights.
numpy==1.24.3
Pillow==10.1.0
matplotlib==3.8.2
@RajMaharajwala
Copy link
Author

RajMaharajwala commented Feb 10, 2024

image image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment