Does it take a long time to complete a task in Java?
I need to search for a string in 10 large files (zip format 70MB) and print the search string strings to the corresponding 10 output files (i.e. the output of file 1 should be in output_file1 ... file2 ---> output_file2) ... This same program takes 15 minutes for one file. But if using 10 threads to read 10 files and write to 10 different files, it should finish in 15 minutes, but it will take 40 minutes.
How can I solve this. Or will multithreading take a long time?
a source to share
File access is generally slower after 2-3 threads, as the hard drive ends up trying to read from all files at the same time, similar to reading a defragmented file.
To avoid this, separate the work into file readers and file analyzers. File readers fetch data from a file (also unpack), and file parsers parse the data. You can use PipedInputStream
/ PipedOutputStream
to forward data from your file readers to parser files.
Since your files are zipped, reading includes both I / O and cpu, which can be easily shuffled through 2-4 streams by going through all the files. The easiest way to parse files is to read only one stream from the PipedInputStream, so you will have one parser stream for each file. Using multiple threads per file requires stream splitting and seaching processing at block boundaries, which complicates the process, and is unnecessary here since you probably have enough parallelism with 10 parser threads and 2-4 reads.
a source to share
I assume you do not have a 10 core cpu so your threads are not running in parallel. so it takes a little longer than it mathematically should. Next - you should be aware that flow control also takes a little time (which doesn't matter). maybe you can speed up the search engine for files to get some speed. for this you will need to post the source code. but some advise:
- you should try to keep the file access score as low as possible as this is the slowest operation
- try to use as little memory as possible, as if your computer starts replacing memory pages, the speed will also drop dramatically.
- since you do it in java - you have to use reg-ex to find a string within strings because (as far as I remember) this is the fastest way to find strings in java
but keep in mind that these measures can lead to very difficult code to read for another person or yourself in ... even six months +, because you will not remember everything you did and why you did it (comments; ))
a source to share
You probably have hard drive competition that multithreading won't help. In your case, you probably want enough threads to keep the disk drive at 100%.
I am assuming the hard drive is your bottleneck and not the processor. Multithreading only makes things "faster" if each thread doesn't have to fight for the same hardware. Thus, with multiple cores (CPU) and multiple hard drives, you will see better multithreading performance.
I'm surprised it takes 15 minutes for one file.
This is how I would do it. 70 MB is small. Perhaps you can load each 70MB uncompressed file into memory, one per stream. Then unzip the data in real time while looking for a compressed stream, storing the bare amount of uncompressed data in memory. (Once you've searched it, throw it away.) This will avoid the hard drive and allow your processor to reach 100% utilization.
If memory is an issue, load several MB at a time from disk.
a source to share
When you run this, is your CPU already 100%? If not, one of two things:
- If it's a hard drive, you can try upgrading to a faster hard drive, RAID0 stripe (dangerous for data loss), or RAID5.
- you have a multi-core processor and for some reason it won't run on all cores. You can check this in Windows by pressing CTRL-ALT-DEL, Task Manager, Performance tab. If the history of CPU usage is maximized on a single graph, you are using underutilized processors and may consider threads. If CPU usage is not maximized anywhere, then you have a bottleneck on your hard drive and no amount of multithreading will do much good. If CPU usage is maximized overall, then threads will do it more slowly; you need a faster processor to speed up this task, or a better algorithm.
a source to share
I'm going to guess this is a GC issue. I guess you are reading files one at a time at a time in String
. You might even recompile the regex for every line. Lots of memory allocation, but short-lived objects anyway. Multiple threads can tip over this to copy them into survivor spaces (in a typical Sun GC implementation). My guess is to use visualvm or an obscure command line argument to control how strong the GC is doing.
It could also be a blocking issue, but it looks awkwardly parallel.
a source to share
You might want to check out the "Wide Finder" project by Tim Bray. This is very similar to what you are doing, and I think I will address most, if not all of the problems you will encounter. HTH
a source to share