Slowdown when reading from urlconnection input stream (even with byte [] and buffers)

Ok, so after spending two days trying to figure out the problem and reading about different articles, I finally decided to go up and ask for advice (my first time here).

Now before the problem - I am writing a program that will analyze the api data from the game, namely the battle. There will be many records in the database (20+ million), so the speed of parsing each page of the battle log matters little.

The pages to be analyzed look like this: http://api.erepublik.com/v1/feeds/battle_logs/10000/0. (see source code when using chrome, it doesn't display the page on the right). It has 1000 hit records followed by a little battle info (last page will have <1000 obviously). On average, a page contains 175,000 characters, UTF-8 encoding, xml format (v 1.0). The program will run locally on a good PC, the memory is almost unlimited (so the creation of the [250,000] byte is quite normal).

The format never changes, which is very convenient.

Now I started as usual:

//global vars,class declaration skipped

    public WebObject(String url_string, int connection_timeout, int read_timeout, boolean redirects_allowed, String user_agent)
                    throws java.net.MalformedURLException, java.io.IOException {
                // Open a URL connection
                java.net.URL url = new java.net.URL(url_string);
                java.net.URLConnection uconn = url.openConnection();
                if (!(uconn instanceof java.net.HttpURLConnection)) {
                    throw new java.lang.IllegalArgumentException("URL protocol must be HTTP");
                }
                conn = (java.net.HttpURLConnection) uconn;
                conn.setConnectTimeout(connection_timeout);   
                conn.setReadTimeout(read_timeout);      
                conn.setInstanceFollowRedirects(redirects_allowed);
                conn.setRequestProperty("User-agent", user_agent);
            }
     public void executeConnection() throws IOException {
            try {
                is = conn.getInputStream(); //global var
                l = conn.getContentLength(); //global var         
            } catch (Exception e) {
            //handling code skipped
            }
    }

//getContentStream and getLength methods which just return'is' and 'l' are skipped

      

This is where the fun part began. I did some profiling (using System.currentTimeMillis ()) to see what takes a long time and what doesn't. Calling this method takes only 200ms with avg

public InputStream getWebPageAsStream(int battle_id, int page) throws Exception {
    String url = "http://api.erepublik.com/v1/feeds/battle_logs/" + battle_id + "/" + page;
    WebObject wobj = new WebObject(url, 10000, 10000, true, "Mozilla/5.0 "
            + "(Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 ( .NET CLR 3.5.30729)");
    wobj.executeConnection();
    l = wobj.getContentLength(); // global variable
    return wobj.getContentStream(); //returns 'is' stream
}

      

Expected 200ms from network operation and I'm fine with it. BUT when I parse the input stream in any way (read it in line / use java XML parser / read it in another ByteArrayStream) the process takes over 1000ms!

for example this code takes 1000ms if i pass the stream i got ('is') above from getContentStream () directly to this method:

public static Document convertToXML(InputStream is) throws ParserConfigurationException, IOException, SAXException {
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder db = dbf.newDocumentBuilder();
        Document doc = db.parse(is);
        doc.getDocumentElement().normalize();
        return doc;
    }

      

this code also takes about 920ms if the initial "is" input stream is passed (don't read in the code itself - it just fetches the data I need by counting characters directly, which can be done thanks to the rigid api feed format):

public static parsedBattlePage convertBattleToXMLWithoutDOM(InputStream is) throws IOException {
        // Point A
        BufferedReader br = new BufferedReader(new InputStreamReader(is));
        LinkedList ll = new LinkedList();
        String str = br.readLine();
        while (str != null) {
            ll.add(str);
            str = br.readLine();
        }           
        if (((String) ll.get(1)).indexOf("error") != -1) {
            return new parsedBattlePage(null, null, true, -1);
        }
        //Point B
        Iterator it = ll.iterator();
        it.next();
        it.next();       
        it.next();
        it.next();
        String[][] hits_arr = new String[1000][4];
        String t_str = (String) it.next();
        String tmp = null;
        int j = 0;
        for (int i = 0; t_str.indexOf("time") != -1; i++) {
            hits_arr[i][0] = t_str.substring(12, t_str.length() - 11);
            tmp = (String) it.next();
            hits_arr[i][1] = tmp.substring(14, tmp.length() - 9);
            tmp = (String) it.next();
            hits_arr[i][2] = tmp.substring(15, tmp.length() - 10);
            tmp = (String) it.next();
            hits_arr[i][3] = tmp.substring(18, tmp.length() - 13);
            it.next();
            it.next();
            t_str = (String) it.next();
            j++;
        }      
        String[] b_info_arr = new String[9];
        int[] space_nums = {13, 10, 13, 11, 11, 12, 5, 10, 13};
        for (int i = 0; i < space_nums.length; i++) {
            tmp = (String) it.next();
            b_info_arr[i] = tmp.substring(space_nums[i] + 4, tmp.length() - space_nums[i] - 1);
        }
        //Point C
        return new parsedBattlePage(hits_arr, b_info_arr, false, j);
    }

      

I tried to replace the default BufferedReader with

BufferedReader br = new BufferedReader(new InputStreamReader(is), 250000);

      

It hasn't changed much. My second attempt was to replace the code between A and B: Iterator it = IOUtils.lineIterator (is, "UTF-8");

Same result except AB was 0ms and BC was 1000ms, so each call to this .next () must have consumed some significant time. (IOUtils from apache-commons-io library).

And here's the culprit - the time it took to parse the stream per line, be it an iterator or BufferedReader in ALL cases, was about 1000ms, and the rest of the code took 0ms (for example, irrelevant). This means that parsing the stream into the LinkedList or repeating it was eating up a lot of my system resources for some reason. the question was why? It's just the way Java was created ... no ... it's just silly, so I did another experiment.

In my main method, which I added after getWebPageAsStream ():

    //Point A
    ba = new byte[l]; // 'l'  comes from wobj.getContentLength above
    bytesRead = is.read(ba); //'is' is our URLConnection original InputStream 
    offset = bytesRead;           
    while (bytesRead != -1) {
        bytesRead = is.read(ba, offset - 1, l - offset);
        offset += bytesRead;
    }
    //Point B
    InputStream is2 = new ByteArrayInputStream(ba);
    //Now just working with 'is2' - the "copied" stream

      

The InputStream-> byte [] conversion took 1000ms again - this is how many ppl suggested to read the InputStream and so far it is slow. And guess what - the 2 parser methods above (convertToXML () and convertBattlePagetoXMLWithoutDOM () when passed 'is2' instead of 'is' took up to 50ms in all 4 cases to complete.

I read the suggestion that the stream waits for a connection before closing before unblocking, so I tried using HttpComponentsClient 4.0 ( http://hc.apache.org/httpcomponents-client/index.html ) but the initial InputStream took the same amount of time to disassemble. for example this code:

public InputStream getWebPageAsStream2(int battle_id, int page) throws Exception {
        String url = "http://api.erepublik.com/v1/feeds/battle_logs/" + battle_id + "/" + page;
        HttpClient httpclient = new DefaultHttpClient();
        HttpGet httpget = new HttpGet(url);      
        HttpParams p = new BasicHttpParams();
        HttpConnectionParams.setSocketBufferSize(p, 250000);
        HttpConnectionParams.setStaleCheckingEnabled(p, false);
        HttpConnectionParams.setConnectionTimeout(p, 5000);
        httpget.setParams(p);           
        HttpResponse response = httpclient.execute(httpget);
        HttpEntity entity = response.getEntity();
        l = (int) entity.getContentLength();
        return entity.getContent();
    }

      

took longer to process (more than 50ms for network only) and parsing time remained the same. Obviously it can be created in such a way that it creates the HttpClient and properties every time (faster network time), but the threading issue is independent of that.

So we come to the central issue - why does the initial URL InputStream URL (or HttpClient InputStream) take so long to process, while any stream of the same size and content created locally is an order of magnitude faster? I mean the initial answer is already somewhere in RAM and I don't see any good results as to why it is processed so slowly compared to when the same stream is only created from byte [].

Considering that I have to analyze millions of records and thousands of such pages, the total processing time is almost 1.5 seconds / page. seems too long. WAY WAY.

Any ideas?

PS Please inquire in any other code required - the only thing I do after parsing is to make a PreparedStatement and put entries in JavaDB in 1000+ batches and the performance is about 200ms / 1000 pieces, prb can be optimized with a lot of cache, but I didn't look very much.

+2


a source to share


1 answer


It takes longer because it is reading from a remote server. The executeConnection () method just creates a thread, it doesn't actually read the entire response from the server. This is done after starting to read from the stream.



+1


a source







All Articles