We propose a tool called BigHASH for efficiently detecting tampering of big data programs (e.g., by malware) when executed in a private cluster or a public cloud environment. BigHASH produces the execution metadata of a program that precisely captures the critical internal data structures and content of the program (at runtime) using graph algorithms and homomorphic hashing. Homomorphic hashing provides two key benefits: (a) It enables parallel hash computation for efficiency. (b) It provides the ability to cope with cluster environments containing different number of servers when executing the program. BigHASH uses a blockchain network to store the execution metadata of programs as it provides a decentralized, secure, tamper-proof storage. To detect whether a program has been tampered or not during execution, BigHASH compares the execution metadata published by the owner (in a trusted environment) on the blockchain network to that produced by a user in his/her cluster environment. BigHASH is simple to use and provides automatic code instrumentation so that a programmer is not burdened to write any extra code to use BigHASH.
The U.S. Food and Drug Administration (FDA) has approved two digital pathology systems for primary diagnosis. These systems produce and consume whole slide images (WSIs) constructed from glass slides using advanced digital slide scanners. WSIs can greatly improve the work ow of pathologists through the development of novel image analytics software for automatic detection of cellular and morphological features and disease diagnosis using histopathology slides. However, the gigabyte size of a WSI poses a serious challenge for storage and retrieval of millions of WSIs. In this paper, we propose a system for scalable storage of WSIs and fast retrieval of image tiles using DRAM. A WSI is partitioned into tiles and sub-tiles using a combination of a space-filling curve, recursive partitioning, and Dewey numbering. They are then stored as a collection of key-value pairs in DRAM. During retrieval, a tile is fetched using key-value lookups from DRAM. Through performance evaluation on a 24-node cluster using 100 WSIs, we observed that, compared to Apache Spark, our system was three times faster to store the 100 WSIs and 1,000 times faster to access a single tile achieving millisecond latency. Such fast access to tiles is highly desirable when developing deep learning-based image analytics solutions on millions of WSIs.
Convolutional neural networks (CNNs) have been popularly used to solve the problem of cell/nuclei classification and segmentation in histopathology images. Despite their pervasiveness, CNNs are fine-tuned on specific, large and labeled datasets as these datasets are hard to collect and annotate. However, this is not a scalable approach. In this work, we aim to gain deeper insights into the nature of the problem. We used a cervical cancer dataset with cells labeled into four classes by an expert pathologist. By employing pre-training on this dataset, we propose a one-shot learning model for cervical cell classification in histopathology tissue images. We extract regional maximum activation of convolutions (R-MAC) global descriptors and train a one-shot learning memory module with the goal of using it for various cancer types and eliminate the need for expensive, difficult to collect, large, labeled whole slide image (WSI) datasets. Our model achieved 94.6% accuracy in detecting the four cell classes on the test dataset. Further, we present our analysis of the dataset and features to better understand and visualize the problem in general.
Whole slide images (WSIs) can greatly improve the workflow of pathologists through the development of software for automatic detection and analysis of cellular and morphological features. However, the gigabyte size of a WSI poses serious challenge for scalable storage and fast retrieval, which is essential for next-generation image analytics. In this paper, we propose a system for scalable storage of WSIs and fast retrieval of image tiles using Apache Spark, a space-filling curve, and popular data storage formats. We investigate two schemes for storing the tiles of WSIs. In the first scheme, all the WSIs were stored in a single table (partitioned by certain table attributes for fast retrieval). In the second scheme, each WSI is stored in a separate table. The records in each table are sorted using the index values assigned by the space-filling curve. We also study two data storage formats for storing WSIs: Parquet and ORC (Optimized Row Columnar). Through performance evaluation on a 16-node cluster in CloudLab, we observed that ORC enables faster retrieval of tiles than Parquet and requires 6 times less storage space. We also observed that the two schemes for storing WSIs achieved comparable performance. On an average, our system took 2 secs to retrieve a single tile and less than 6 seconds for 8 tiles on up to 80 WSIs. We also report the tile retrieval performance of our system on Microsoft Azure to gain insight on how the underlying computing platform can affect the performance of our system.
Kobus Barnard, Andrew Connolly, Larry Denneau, Alon Efrat, Tommy Grav, Jim Heasley, Robert Jedicke, Jeremy Kubica, Bongki Moon, Scott Morris, Praveen Rao
We describe a proposed architecture for the Large Synoptic Survey Telescope (LSST) moving object processing pipeline based on a similar system under development for the Pan-STARRS project. This pipeline is responsible for identifying and discovering fast moving objects such as asteroids, updating information about them, generating appropriate alerts, and supporting queries about moving objects. Of particular interest are potentially hazardous asteroids(PHA's).
We consider the system as being composed of two interacting components. First, candidate linkages corresponding to moving objects are found by tracking detections ("tracklets"). To achieve this in reasonable time we have developed specialized data structures and algorithms that efficiently evaluate the possibilities using quadratic fits of the detections on a modest time scale.
For the second component we take a Bayesian approach to validating, refining, and merging linkages over time. Thus new detections increase our belief that an orbit is correct and contribute to better orbital parameters. Conversely, missed expected detections reduce the probability that the orbit exists. Finally, new candidate linkages are confirmed or refuted based on previous images.
In order to assign new detections to existing orbits we propose bipartite graph matching to find a maximum likelihood assignment subject to the constraint that detections match at most one orbit and vice versa. We describe how to construct this matching process to properly deal with false detections and missed detections.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.